Skip to main content

Deepseek-R1 671B Inference

This tutorial will guide you through the process of running inference on a trained DeepSeek R1 671B model.

Create an Instance with DeepSeek R1 Image

To get started with DeepSeek R1 on HPC-AI.com, follow the instructions here to create a new instance with the DeepSeek R1 image. When selecting an image, navigate to the Advanced Images section and choose DeepSeek-R1-671B-Inference(Online Serving), then continue with the remaining steps to create the instance.

Note: H200 GPU is recommended for Deepseek R1, as it supports single-node deployment. If you want to use H100 or other GPUs, you will need to run multiple nodes to load the model.

Access the Jupyter Notebook

Once the instance is running, click on the Jupyter Notebook link to access the Jupyter Notebook interface. Open a terminal in the Jupyter Notebook interface to launch inference.

vllm serve /root/commonData/DeepSeek-R1 \
--host 0.0.0.0 \
--port 8000 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--tensor-parallel-size 8 \
--load-format auto \
--trust-remote-code \
--served-model-name deepseek-ai/DeepSeek-R1

It will take a while to load the model, please be patient.

Inference

After the model is loaded, you can start making inference requests.

Open a new terminal in the Jupyter Notebook interface and run the following command to make an inference request.

curl "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Write a haiku that explains the concept of recursion."
}
]
}'

You can also curl locally from your machine by replacing localhost with the instance IP address. You can find the instance IP address in Quick Tools -> Http Ports.