Skip to main content

Colossal-RL: Democratizing Reinforcement Learning with Colossal-AI

Overview

In this section, we introduce how we can run reinforcement learning with Colossal-AI to train your own RL model. We support the main algorithm used to train DeepSeek R1 model, a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while optimizing the memory usage of PPO.

Data

The following example illustrates how training data should be constructed. We accept JSONL format with each line having the following structure:

{   
"messages": {
"role": "user",
"content": "Let \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)."
},
"gt_answer": "0"
}
  • content: Normally math questions
  • gt_answer: Ground truth answers

Training Script

Users are allowed to adjust training parameters to train their own model. The following parameters are suggested to change:

ParameterDescription
-m, --modelLocal model path
-d, --datasetLocal data path
-s, --system-promptSystem prompt to construct the dataset
-p, --projectProject name for Wandb
-g, --num-generationsNumber of generations per prompt
-e, --num-episodesNumber of episodes
-lr, --learning-rateLearning rate for GRPO
-kl, --kl-coeffKL penalty coefficient for GRPO
-si, --save-intervalInterval for saving checkpoints
-sd, --save-dirDirectory for saving checkpoints
-mnt, --max-new-tokensMax length for generation
-mpt, --max-new-tokensMax length for prompt
-temp, --temperatureTemperature for sampling
-topk, --top-kTop k for sampling
-topp, --top-pTop p for sampling
-ibs, --inference-batch-sizeNumber of prompts to generate per inference step. It should be divisible by tbs, and the weights on the inference backend will be synced every ibs/tbs training steps of the policy model
-imbs, --inference-microbatch-sizeEffective batch size for the inference backend to run generation. Please select based on memory constraint
-tbs, --train-batch-sizeNumber of unique prompts to update policy model per step per dp group. Gradient is accumulated across tbs dp_size unique prompts, equivalently tbs g * dp_size samples
-tMbs, --train-minibatch-sizeNumber of unique prompts in each training batch per dp group. The inference backend must generate tMbs g dp_size samples before forwarding. Satisfy tMbs * g >= tmbs
-tmbs, --train-microbatch-sizeEffective batch size per dp group for forwarding and backwarding. Please select based on the available memory
note

For other parameters, we suggest keeping them as default to avoid unexpected issues. We are under intensive development and will release more features soon.

Template

A few examples are provided for users to quickly start your training.

Model sizeGPU typeNum GPUsPolicyMax new tokensMax prompt tokensTrain micro batchsizeMax CUDA memory (Terminal)
3BH20: 98G4Producer 2 Consumer 2 Zero21024 * 4 - 5125122~55G
3BH20: 98G4Producer 2 Consumer 2 Zero21024 * 4 - 5125124~72G
3BH20: 98G4Producer 2 Consumer 2 Zero21024 * 8 - 5125122~72G
3BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125122~50G
3BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125124~60G
3BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125128~100G
7BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125122~85G
7BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125124~100G
7BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125128~138G
14BH200: 140G8Producer 4 Consumer 4 Zero21024 * 8 - 5125122~140G