🚀 New: RUNRL JOB Is Live on HPC-AI.COM

Reinforcement learning fine-tuning (RFT) is powerful — but let's face it: it used to be a pain to run. Dual networks, huge memory needs, tons of config files...

That's why we built RUNRL JOB — the easiest way to run RFT workloads like GRPO directly on HPC-AI.COM. No complicated setup. Just pick your model, launch your job, and go.

Watch the demo video:

Explore more about RFT(Reinforcement Fine-Tuning) job with our step-by-step tutorial

🧠 Why Everyone's Talking About RFT (and GRPO)

Reinforcement Fine-Tuning (RFT) has become the go-to method for aligning language models — but many popular approaches like PPO are resource-intensive. Enter GRPO: a lightweight, critic-free alternative that's stable, fast, and efficient. No separate value network, no double backward pass, and a much smaller memory footprint.

GRPO maintains the trust-region stability of PPO while cutting memory usage by over 40%, making it ideal for LLM reasoning, code generation, and complex math tasks. It's a smarter way to fine-tune — perfect if you want to reduce costs and speed up your workflow without sacrificing quality.

💡 GRPO on Qwen-3B: Our Results

We've put the GRPO algorithm from DeepSeek to the test using the Qwen-3B-Base model, and the results are exciting.

Reward Function Design:

Reward = 0 if the format is incorrect.
Reward = 1 if the format is correct but the result is incorrect.
Reward = 10 if both the format and result are correct.

We provide a conversation template and settings to verify GRPO, demonstrated on the Qwen2.5-3B-Base model. You can find the template here: 🔗 Qwen2.5-3B Conversation Template

Ready to start training? Just run this bash script: 🔗 GRPO Training Script

In the GRPO section, we also share insights from the verification process and detailed explanations of the key parameters for your reference.

The code features a flexible reward function template, so you can customize the scoring system to fit your specific needs.

Even on a 3B model, we observe steady improvements in reward scores and response length — proof that GRPO effectively boosts reasoning and output quality.

截屏2025-06-17 10.19.54

⚙️ Run GRPO in One Click on HPC-AI.COM

Want to try it yourself? Now it's easier than ever.

We've made RFT completely plug-and-play on HPC-AI.COM. With the RunRL job feature, you can launch full reinforcement learning fine-tuning workflows — like GRPO — with zero setup and maximum flexibility.

Please refer to GRPO vs Other RL Algorithms: A Simple, Clear Guide for more details.

No friction. No boilerplate. Just launch, fine-tune, and go.