Skip to main content

Distributed Deployment of DeepSeek-V3 (671B) on H200 Clusters: Cloud Inference for Billion-Parameter Models

HPC-AI delivers stable, efficient, and high-performance computing resources. Quickly deploy and publish dedicated large model services on our cloud platform.

Step 0: Prepare Your Model

Run custom models by uploading them to your instance or shared storage. [How to upload data?]

Example: Mounting DeepSeek-V3 via Shared Storage

  1. Create shared storage (up to 1000GB) for multi-machine data access. [Custom storage capacity?]
    img.png
  2. Launch an instance in the same region as your shared storage and mount it during configuration.
    img.png
  3. Access the storage directory via Jupyter/SSH and download the model using Hugging Face CLI:
    pip install huggingface_hub
    huggingface-cli download --resume-download deepseek-ai/DeepSeek-V3 --local-dir ./models/DeepSeek-V3

Step 1: Configure Your Instance

  1. Launch two H200 x 8 instances in your shared storage region and mount the storage.
    img.png
    img.png
  2. Configure the environment to leverage high-performance cluster features (InfiniBand/IBGDA):
    export NCCL_IB_GID_INDEX=3
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_DISABLE=0
    # Optional debugging:
    # export CUDA_LAUNCH_BLOCKING=1
    # export NCCL_DEBUG=INFO

Step 2: Launch Distributed Inference with SGLang

Use SGLang for distributed inference across nodes.

  1. Environment Setup

    pip install "sglang[all]>=0.4.8"
    sudo apt update && sudo apt install -y libnuma1
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
  2. Multi-Node Launch
    Install network tools and get the master node's IP:

    sudo apt-get install -y net-tools
    ifconfig # Note the eth0 inet address

    Run these commands on respective nodes:

    # Master Node (Node 0)
    python3 -m sglang.launch_server --model-path <your_model_path> --tp 16 \
    --dist-init-addr <MASTER_IP>:50000 --nnodes 2 --node-rank 0 \
    --port 30000 --trust-remote-code

    # Worker Node (Node 1)
    python3 -m sglang.launch_server --model-path <your_model_path> --tp 16 \
    --dist-init-addr <MASTER_IP>:50000 --nnodes 2 --node-rank 1 \
    --port 30000 --trust-remote-code

Step 3: Publish, Manage & Monitor Your Service

  1. Publish your service by configuring instance startup options and HTTP ports.
    img.png
    img.png
  2. Test the service:
    curl -s http://[HttpPortsAddress]/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "DeepSeek-V3",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
    }'
  3. Terminate the service:
    ps aux | grep sglang.launch_server
    kill -9 <PID>