Skip to main content

GPT-OSS-120B Deployment Guide on HPC-AI.COM

We are excited to announce the open-source release of GPT-OSS-120B, a large language model built for production-grade applications, general-purpose tasks, and advanced reasoning. At HPC-AI.COM, you can easily deploy this model with unmatched efficiency and convenience powered by our high-performance infrastructure.

1. Environment Setup

1.1 Pre-configured Development Environment

We provide ready-to-use high-performance development environments. Simply select a pre-built image (e.g., CUDA 12.8) to launch a full-featured GPU cloud instance.

Prepare-1.png

1.2 Install Your Preferred Inference Framework

After launching your instance, install the inference framework that best suits your needs.

Prepare-2.png

For example, to set up a basic vLLM environment:

pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match

2. Model Privatization

We provide cluster-level caching to load models at exceptional speed. For instance, the entire 183GB GPT-OSS-120B model can be downloaded in under 5 minutes, with speeds up to 1.1 GB/s.

We also configure data disks and high-speed shared storage so you can easily access the model in a fully isolated environment.

Model-1.png

Example: Privatized Model Download

cd ${YourModelPath}
#!/bin/bash
export model="openai/gpt-oss-120b"
curl -sSL https://d.juicefs.com/install | sh -
juicefs sync minio://minio:minio123@minio:9000/hf-model/${model}/ ./${model}/

3. Public Inference Service

We support public HTTP forwarding so you can expose your inference endpoint to the internet.

Inference-1.png

Inference-2.png

Start a vLLM server on 8 high-performance GPUs using:

vllm serve ${modelPath} --tensor-parallel-size 8

📌 Note: In vLLM, the modelPath must match the value passed as the model name.

Sample Forwarding Address:

Inference-3.png

https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com

Test with curl

curl -s https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "${modelPath}",
"messages": [{"role": "user", "content": "Hello, please introduce yourself"}],
"max_tokens": 100
}'

HPC-AI.COM Performance Demo

✅ Complex Task Inference:

Single machine throughput: 200 tokens/s on tasks like BIG-Bench Hard – Web of Lies.

Sample Request:

curl -X POST "https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "${ModelPath}",
"messages": [{
"role": "user",
"content": "Solve this logic puzzle: There are 4 people - Alice, Bob, Charlie, Diana..."
}],
"max_tokens": 8192
}'

Performance Logs

Avg generation throughput: 249.9 tokens/s
Prefix cache hit rate: 39.1%

✅ Tool Use + Streaming Inference:

Tool-assisted generation at 160 tokens/s

Request:

curl -X POST "https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "/root/dataDisk/openai/gpt-oss-120b/",
"messages": [{
"role": "user",
"content": "I need to analyze NVIDIA stock. Please search for recent earnings news and get the current stock price."
}],
"tools": [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search for information",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string" }
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get stock price",
"parameters": {
"type": "object",
"properties": {
"symbol": { "type": "string" }
},
"required": ["symbol"]
}
}
}
],
"max_tokens": 4096
}'

Agentic Inference with OpenAI Agent SDK

You can use the OpenAI Agent SDK to build tool-using agents with GPT-OSS:

Example Code

import asyncio
from openai import AsyncOpenAI
from agents import Agent, Runner, function_tool, OpenAIResponsesModel, set_tracing_disabled

set_tracing_disabled(True)

@function_tool
def get_weather(city: str):
print(f"[debug] getting weather for {city}")
return f"The weather in {city} is sunny."

async def main():
agent = Agent(
name="Assistant",
instructions="You only respond in haikus.",
model=OpenAIResponsesModel(
model="/root/dataDisk/openai/gpt-oss-120b/",
openai_client=AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
),
),
tools=[get_weather],
)

result = await Runner.run(agent, "What's the weather in Tokyo?")
print(result.final_output)

if __name__ == "__main__":
asyncio.run(main())

Output Example

[debug] getting weather for Tokyo
Tokyo sun glows bright
Cherry blossoms smile anew
Gentle warmth embraces