Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
arXiv cs.LG / 4/15/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper explores making on-policy distillation (OPD) offline by precomputing teacher log-probabilities from SFT rollouts instead of running a live teacher inference server during training.
- It finds a critical requirement called “teacher consistency,” meaning the same teacher model must be used for both SFT and OPD; violating it creates an irreducible gradient bias that drives training to a suboptimal fixed point.
- Based on this, the authors propose “Lightning OPD,” an offline OPD framework that enforces teacher consistency while completely removing the need for a live teacher server.
- Experiments on mathematical reasoning and code generation show Lightning OPD achieves state-of-the-art results with bounded gradient discrepancy and implicit regularization that helps prevent policy drift.
- Using an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in 30 GPU hours, reporting a 4.0x speedup over standard OPD and lower barriers for academic post-training research.
Related Articles

Anthropic prepares Opus 4.7 and AI design tool, VCs offer up to 800 billion dollars
THE DECODER

After sale of its shoe business, Allbirds pivots to AI
TechCrunch

ChatGPT Custom Instructions: The Ultimate Setup Guide
Dev.to

Best ChatGPT Alternatives 2026: 8 AI Tools Compared
Dev.to

Nghịch Lý Constraint: Hạn Chế AI Agent Nhiều Hơn, Code Tốt Hơn
Dev.to