π New: RUNRL JOB Is Live on HPC-AI.COM
Reinforcement learning fine-tuning (RFT) is powerful β but let's face it: it used to be a pain to run. Dual networks, huge memory needs, tons of config files...
That's why we built RUNRL JOB β the easiest way to run RFT workloads like GRPO directly on HPC-AI.COM. No complicated setup. Just pick your model, launch your job, and go.
Watch the demo video:
Explore more about RFT(Reinforcement Fine-Tuning) job with our step-by-step tutorial
π§ Why Everyone's Talking About RFT (and GRPO)
Reinforcement Fine-Tuning (RFT) has become the go-to method for aligning language models β but many popular approaches like PPO are resource-intensive. Enter GRPO: a lightweight, critic-free alternative that's stable, fast, and efficient. No separate value network, no double backward pass, and a much smaller memory footprint.
GRPO maintains the trust-region stability of PPO while cutting memory usage by over 40%, making it ideal for LLM reasoning, code generation, and complex math tasks. It's a smarter way to fine-tune β perfect if you want to reduce costs and speed up your workflow without sacrificing quality.
π‘ GRPO on Qwen-3B: Our Results
We've put the GRPO algorithm from DeepSeek to the test using the Qwen-3B-Base model, and the results are exciting.
Reward Function Design:
- Reward = 0 if the format is incorrect.
- Reward = 1 if the format is correct but the result is incorrect.
- Reward = 10 if both the format and result are correct.
We provide a conversation template and settings to verify GRPO, demonstrated on the Qwen2.5-3B-Base model. You can find the template here: π Qwen2.5-3B Conversation Template
Ready to start training? Just run this bash script: π GRPO Training Script
In the GRPO section, we also share insights from the verification process and detailed explanations of the key parameters for your reference.
The code features a flexible reward function template, so you can customize the scoring system to fit your specific needs.
Even on a 3B model, we observe steady improvements in reward scores and response length β proof that GRPO effectively boosts reasoning and output quality.
βοΈ Run GRPO in One Click on HPC-AI.COM
Want to try it yourself? Now it's easier than ever.
We've made RFT completely plug-and-play on HPC-AI.COM. With the RunRL job feature, you can launch full reinforcement learning fine-tuning workflows β like GRPO β with zero setup and maximum flexibility.
Please refer to GRPO vs Other RL Algorithms: A Simple, Clear Guide for more details.
No friction. No boilerplate. Just launch, fine-tune, and go.