SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • A new arXiv study argues that reported gains from mixed supervised-and-reinforcement learning (mixed-policy) methods for LLM reasoning are largely due to flawed baselines.
  • The authors identify two bugs—one in DeepSpeed (CPU-offloaded optimizer dropping intermediate micro-batches during gradient accumulation) and one in OpenRLHF (incorrect loss weighting across mini-batches)—that depress SFT performance.
  • After fixing these issues, the standard SFT-then-RL pipeline outperforms all evaluated mixed-policy methods, improving math benchmark scores by +3.8 points on Qwen2.5-Math-7B and by +22.2 points on Llama-3.1-8B.
  • The study also finds that a reduced setup with only 50 RL steps can beat mixed-policy methods on math benchmarks while using fewer FLOPs.
  • The results imply that some recent mixed-policy conclusions may need re-evaluation across multiple downstream training frameworks affected by the underlying bugs.

Abstract

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.