SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
arXiv cs.LG / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- A new arXiv study argues that reported gains from mixed supervised-and-reinforcement learning (mixed-policy) methods for LLM reasoning are largely due to flawed baselines.
- The authors identify two bugs—one in DeepSpeed (CPU-offloaded optimizer dropping intermediate micro-batches during gradient accumulation) and one in OpenRLHF (incorrect loss weighting across mini-batches)—that depress SFT performance.
- After fixing these issues, the standard SFT-then-RL pipeline outperforms all evaluated mixed-policy methods, improving math benchmark scores by +3.8 points on Qwen2.5-Math-7B and by +22.2 points on Llama-3.1-8B.
- The study also finds that a reduced setup with only 50 RL steps can beat mixed-policy methods on math benchmarks while using fewer FLOPs.
- The results imply that some recent mixed-policy conclusions may need re-evaluation across multiple downstream training frameworks affected by the underlying bugs.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
How I Automate My Dev Workflow with Claude Code Hooks
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming
Dev.to