Explore Reinforcement Fine-Tuning
Reinforcement Fine-Tuning (RFT) is a post-training technique that leverages reinforcement learning to align large language models with human preferences, enhancing their performance on complex, open-ended tasks. This guide walks you through how to run an RFT job on the HPC-AI.com platform using built-in large language models.
Step 1: Go to the RFT Job Page
- Log in to HPC-AI.com.
- From the left sidebar, click Jobs, then select Run RL Job.
Step 2: Configure Your Job
Fill in the required fields:
- Job Name: Enter a name for your job.
- Resource Configuration:
- GPU Type: Choose from H100 or H200 GPUs, with B200 coming soon.
- GPU Region: Select your preferred compute region, such as Singapore or United States.
- Remote Storage: Select a remote storage in the same region as your GPU to read/write training data and models. If no remote storage exists for that region, you will need to create one.
Step 3: Upload Training Data
Upload your training data using the upload box below.
Example Format for Training Data
The following example illustrates how training data should be constructed. We accept JSONL format with each line having the following structure:
{
"messages": {
"role": "user",
"content": "Let \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)."
},
"gt_answer": "0"
}
content
: Normally math questionsgt_answer
: Ground truth answers
You can also click Example Jsonl File to see the recommended format.
Step 4: Select Model Template
Choose from the built-in model templates. Currently supported:
- Qwen 2.5 - 3B
- Qwen 2.5 - 7B
- Qwen 2.5 - 14B
All templates use the GRPO algorithm for reinforcement fine-tuning. GRPO (Generalized Reinforcement Policy Optimization), proposed by Deepseek, offers advantages such as faster convergence, improved alignment with human preferences, and reduced training variance compared to traditional reinforcement learning methods. If you want to know more about GRPO, you can visit this guide.
For inquiries about additional templates, please reach out to us at service@hpc-ai.com.
Step 5: Provide wandb Key (Optional but Recommended)
- Enter your Weights & Biases API key to enable tracking.
- This allows you to monitor GPU-level metrics such as:
- GPU frequency
- Utilization
- I/O performance
Step 6: Submit and Monitor Your Job
- Click Submit to start the job.
- Monitor the job status under Job Status.
- Once the status changes to Succeeded, your fine-tuned model is ready for you.