GPT-OSS-120B Deployment Guide on HPC-AI.COM

We are excited to announce the open-source release of GPT-OSS-120B, a large language model built for production-grade applications, general-purpose tasks, and advanced reasoning. At HPC-AI.COM, you can easily deploy this model with unmatched efficiency and convenience powered by our high-performance infrastructure.

1. Environment Setup

1.1 Pre-configured Development Environment

We provide ready-to-use high-performance development environments. Simply select a pre-built image (e.g., CUDA 12.8) to launch a full-featured GPU cloud instance.

1.2 Install Your Preferred Inference Framework

After launching your instance, install the inference framework that best suits your needs.

For example, to set up a basic vLLM environment:

pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

2. Model Privatization

We provide cluster-level caching to load models at exceptional speed. For instance, the entire 183GB GPT-OSS-120B model can be downloaded in under 5 minutes, with speeds up to 1.1 GB/s.

We also configure data disks and high-speed shared storage so you can easily access the model in a fully isolated environment.

Example: Privatized Model Download

cd ${YourModelPath}
#!/bin/bash
export model="openai/gpt-oss-120b"
curl -sSL https://d.juicefs.com/install | sh -
juicefs sync minio://minio:minio123@minio:9000/hf-model/${model}/ ./${model}/

3. Public Inference Service

We support public HTTP forwarding so you can expose your inference endpoint to the internet.

Start a vLLM server on 8 high-performance GPUs using:

vllm serve ${modelPath} --tensor-parallel-size 8

📌 Note: In vLLM, the modelPath must match the value passed as the model name.

Sample Forwarding Address:

https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com

Test with `curl`

curl -s https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "${modelPath}",
    "messages": [{"role": "user", "content": "Hello, please introduce yourself"}],
    "max_tokens": 100
  }'

HPC-AI.COM Performance Demo

✅ Complex Task Inference:

Single machine throughput: 200 tokens/s on tasks like BIG-Bench Hard – Web of Lies.

Sample Request:

curl -X POST "https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
  "model": "${ModelPath}",
  "messages": [{
    "role": "user",
    "content": "Solve this logic puzzle: There are 4 people - Alice, Bob, Charlie, Diana..."
  }],
  "max_tokens": 8192
}'

Performance Logs

Avg generation throughput: 249.9 tokens/s
Prefix cache hit rate: 39.1%

✅ Tool Use + Streaming Inference:

Tool-assisted generation at 160 tokens/s

Request:

curl -X POST "https://notebook-95a61cb8-7296-11f0-882e-8adcb39cb4eb-8000.na-usa-1.hpc-ai.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
  "model": "/root/dataDisk/openai/gpt-oss-120b/",
  "messages": [{
    "role": "user",
    "content": "I need to analyze NVIDIA stock. Please search for recent earnings news and get the current stock price."
  }],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "web_search",
        "description": "Search for information",
        "parameters": {
          "type": "object",
          "properties": {
            "query": { "type": "string" }
          },
          "required": ["query"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "get_stock_price",
        "description": "Get stock price",
        "parameters": {
          "type": "object",
          "properties": {
            "symbol": { "type": "string" }
          },
          "required": ["symbol"]
        }
      }
    }
  ],
  "max_tokens": 4096
}'

Agentic Inference with OpenAI Agent SDK

You can use the OpenAI Agent SDK to build tool-using agents with GPT-OSS:

Example Code

import asyncio
from openai import AsyncOpenAI
from agents import Agent, Runner, function_tool, OpenAIResponsesModel, set_tracing_disabled

set_tracing_disabled(True)

@function_tool
def get_weather(city: str):
    print(f"[debug] getting weather for {city}")
    return f"The weather in {city} is sunny."

async def main():
    agent = Agent(
        name="Assistant",
        instructions="You only respond in haikus.",
        model=OpenAIResponsesModel(
            model="/root/dataDisk/openai/gpt-oss-120b/",
            openai_client=AsyncOpenAI(
                base_url="http://localhost:8000/v1",
                api_key="EMPTY",
            ),
        ),
        tools=[get_weather],
    )

    result = await Runner.run(agent, "What's the weather in Tokyo?")
    print(result.final_output)

if __name__ == "__main__":
    asyncio.run(main())

Output Example

[debug] getting weather for Tokyo
Tokyo sun glows bright  
Cherry blossoms smile anew  
Gentle warmth embraces

1. Environment Setup​

1.1 Pre-configured Development Environment​

1.2 Install Your Preferred Inference Framework​

2. Model Privatization​

Example: Privatized Model Download​

3. Public Inference Service​

Test with curl​

HPC-AI.COM Performance Demo​

✅ Complex Task Inference:​

Sample Request:​

Performance Logs​

✅ Tool Use + Streaming Inference:​

Request:​

Agentic Inference with OpenAI Agent SDK​

Example Code​

Output Example​