SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

A new arXiv study argues that reported gains from mixed supervised-and-reinforcement learning (mixed-policy) methods for LLM reasoning are largely due to flawed baselines.
The authors identify two bugs—one in DeepSpeed (CPU-offloaded optimizer dropping intermediate micro-batches during gradient accumulation) and one in OpenRLHF (incorrect loss weighting across mini-batches)—that depress SFT performance.
After fixing these issues, the standard SFT-then-RL pipeline outperforms all evaluated mixed-policy methods, improving math benchmark scores by +3.8 points on Qwen2.5-Math-7B and by +22.2 points on Llama-3.1-8B.
The study also finds that a reduced setup with only 50 RL steps can beat mixed-policy methods on math benchmarks while using fewer FLOPs.
The results imply that some recent mixed-policy conclusions may need re-evaluation across multiple downstream training frameworks affected by the underlying bugs.

Abstract

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/28DailyView insight →

How I Automate My Dev Workflow with Claude Code Hooks

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Dev.to

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Key Points

Abstract

💡 Insights using this article

Related Articles

How I Automate My Dev Workflow with Claude Code Hooks

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Real-Time Monitoring for AI Agents: Beyond Log Streaming

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer