Skip to main content

Explore Reinforcement Fine-Tuning

Reinforcement Fine-Tuning (RFT) is a post-training technique that leverages reinforcement learning to align large language models with human preferences, enhancing their performance on complex, open-ended tasks. This guide walks you through how to run an RFT job on the HPC-AI.com platform using built-in large language models.

Step 1: Go to the RFT Job Page

  1. Log in to HPC-AI.com.
  2. From the left sidebar, click Jobs, then select Run RL Job.

rl-job.jpg

Step 2: Configure Your Job

Fill in the required fields:

  • Job Name: Enter a name for your job.
  • Resource Configuration:
    • GPU Type: Choose from H100 or H200 GPUs, with B200 coming soon.
    • GPU Region: Select your preferred compute region, such as Singapore or United States.
    • Remote Storage: Select a remote storage in the same region as your GPU to read/write training data and models. If no remote storage exists for that region, you will need to create one.

rl-job-resource.jpg

Step 3: Upload Training Data

Upload your training data using the upload box below.

rl-job-data.jpg

Example Format for Training Data

The following example illustrates how training data should be constructed. We accept JSONL format with each line having the following structure:

{   
"messages": {
"role": "user",
"content": "Let \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)."
},
"gt_answer": "0"
}
  • content: Normally math questions
  • gt_answer: Ground truth answers

You can also click Example Jsonl File to see the recommended format.

Step 4: Select Model Template

Choose from the built-in model templates. Currently supported:

  • Qwen 2.5 - 3B
  • Qwen 2.5 - 7B
  • Qwen 2.5 - 14B

rl-job-model.jpg

All templates use the GRPO algorithm for reinforcement fine-tuning. GRPO (Generalized Reinforcement Policy Optimization), proposed by Deepseek, offers advantages such as faster convergence, improved alignment with human preferences, and reduced training variance compared to traditional reinforcement learning methods. If you want to know more about GRPO, you can visit this guide.

For inquiries about additional templates, please reach out to us at service@hpc-ai.com.

  • Enter your Weights & Biases API key to enable tracking.
  • This allows you to monitor GPU-level metrics such as:
    • GPU frequency
    • Utilization
    • I/O performance

rl-job-wandb.jpg

Step 6: Submit and Monitor Your Job

  • Click Submit to start the job.
  • Monitor the job status under Job Status.
  • Once the status changes to Succeeded, your fine-tuned model is ready for you.

rl-job-succeed.jpg

rl-job-storage.jpg