hpc-ai logohpc-ai logo
ProductCompanyContact UsDocumentationBlogPricing
Video Ocean
🔥 NVIDIA Blackwell B200s servers are now ready from $1.49/hr for rent on Cloud! Claim your nodes now to enjoy the performance gain >

Accelerating AI with NVIDIA B200: A Complete Quickstart Guide on HPC-AI.com

Unlocking the Power of NVIDIA B200: Quickstart Guide on HPC-AI.com

At HPC-AI.COM, we are excited to fully support the NVIDIA B200 architecture — empowering users with next-generation performance for AI and HPC workloads. In this blog, we will show you how to get started with B200 on our platform and maximize its power.

Getting Started: Launching a B200 Instance with CUDA 12.8+

Blackwell architecture requires CUDA version >= 12.8, do make sure the software (e.g. PyTorch, vLLM, SGLang) is compiled correctly. Here we select a base image with conda pre-installed, allowing you to easily manage dependencies and environments.

Set Up Your Conda Environment

To avoid conflicts between multiple applications running in the same container, we recommend creating a dedicated conda environment. This ensures a clean and efficient workspace:

# the following tutorial will be using vllm/sglang
# you may chooose either one here
$ conda create -n vllm python=3.10
$ conda activate vllm

You can choose to work with either vLLM or SGLang, depending on your use case.

Download the Model

Next, you will need to download your model to the server. You can either use local storage or attach external storage to share between instances (check https://hpc-ai.com/doc/docs/storage/ for details). Keep in mind that the home directory provides limited space (50GB) and cannot be shared across multiple instances.

For this guide, we’ll use the Qwen/Qwen3-8B model as an example:

$ pip install -U "huggingface_hub[cli]"
$ export YOUR_DOWNLOAD_PATH=$HOME/dataDisk  # replace with local or attached storage
$ hf download Qwen/Qwen3-8B --local-dir $YOUR_DOWNLOAD_PATH/Qwen/Qwen3-8B

Serving Your Model: Framework-Specific Setup

Once the model is downloaded, the next step is to serve it using the appropriate framework. Below are instructions for vLLM and SGLang.

vLLM Framework

Environment Setup for vLLM

While your model downloads, you can begin installing the necessary dependencies for vLLM. Note that the installation process may vary depending on your package manager. For the most accurate instructions, always refer to the vLLM installation guide.

To install vLLM with CUDA 12.8 check the PyTorch version:

# Install vLLM with CUDA 12.8.
$ pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128

# Check versions of important dependencies such as PyTorch
$ python -c "import torch; print(torch.__version__)"
# 2.8.0+cu128

Example request

# Launch server
$ cd $YOUR_DOWNLOAD_PATH && vllm serve Qwen/Qwen3-8B

# Send an HTTP request
$ curl -X 'POST' \
    'http://0.0.0.0:8000/v1/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "prompt": "vLLM is a framework that ",
        "max_tokens": 128
    }'

SGLang Framework

Environment Setup for SGLang

Similarly, you can install the necessary dependencies for SGLang while the model is downloading. For the most up-to-date installation instructions, visit the official SGLang documentation.

To install SGLang and confirm the installation, you can check the versions of key dependencies:

# Install SGLang
$ pip install "sglang[all]>=0.5.3rc0"

# Check versions of important packages
$ pip freeze | grep -E "sglang|torch"
# 2.8.0+cu128

Example request

# Launch server
$ cd $YOUR_DOWNLOAD_PATH && python -m sglang.launch_server --model-path Qwen/Qwen3-8B

# Send an HTTP request
$ curl -X 'POST' \
    'http://0.0.0.0:30000/v1/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "prompt": "SGLang is a framework that ",
        "max_tokens": 128
    }'

TensorRT-LLM with FP4 Precision

Beyond higher computing performance compared to the H200, B200 supports FP4 low-precision inference — delivering faster, more memory-efficient model execution and unlocking new possibilities for AI workloads.

Let’s take a quick example to let you experience the power of FP4.

Environment Setup for TensorRT-LLM

To optimize inference with FP4 precision, ensure your system meets the following requirements: Python 3.10 and CUDA Driver version 12.8 or higher. Then, proceed with the installation of the necessary libraries:

# make sure python==3.1.0, CUDA Driver >=12.8
pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

sudo apt-get -y install libopenmpi-dev

Download the FP4 Model Weights

Download the FP4-optimized model weights from Hugging Face:

hf download nvidia/Qwen3-8B-FP4 --local-dir $YOUR_DOWNLOAD_PATH/Qwen3-8B-FP4

Example request:

from tensorrt_llm import LLM, SamplingParams


def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="$YOUR_DOWNLOAD_PATH/Qwen3-8B-FP4")

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


# The entry point of the program needs to be protected for spawning processes.
if __name__ == '__main__':
    main()

Performance Comparison on 1x B200 GPU

We benchmarked the Qwen3-8B models using both the FP4-optimized version and the standard version to compare inference times.

PromptGeneration TextGeneration Time (s)Speed Up
Qwen3-8B-FP4Write a short piece about Harry PotterHarry Potter, the boy who lived, had always known he was different. From the moment he was left on the Dursleys' doorstep, he had been0.12911.26 X
Qwen3-8BWrite a short piece about Harry PotterHarry Potter, the boy who lived, had always felt like an outsider in the wizarding world. Despite his extraordinary abilities, he often felt invisible, overshadowed0.16391.0 X

Conclusion: Powering the Future of AI with B200

The NVIDIA B200 GPU delivers a transformative leap in performance, particularly for large-scale AI applications. Whether you are fine-tuning large language models or running inference-heavy workloads, the B200 is built for speed and scalability.

From faster model serving to optimized inference times, the B200 is the ideal choice for organizations seeking to accelerate their AI development pipeline. Start deploying B200 instances on HPC-AI.com today and experience the next frontier of AI performance.

hpc-ai logo

HPC AI TECHNOLOGY PTE. LTD.

1 MARITIME SQUARE HARBOURFRONT

CENTRE #11-18, Singapore

Product

  • Cloud GPUs
  • Reserved Cluster
  • Fine-tuning Jobs
  • Batch Jobs
  • B200 SXM6
  • H200 SXM5
  • RTX 5090
  • RTX 4090

Resources

  • Pricing
  • Docs
  • Blog
  • API Service
  • Referral Program
  • Quick Start

Community

  • Slack
  • Twitter
  • Github
  • Youtube

Legal

  • Privacy Policy
  • Terms of Service

Company

  • About Us
  • Contact Us
  • Newsroom
  • Research Papers
FacebookXGitHubMediumLinkedinSlack

Copyright © 2025, HPC AI TECHNOLOGY PTE. LTD.