Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
arXiv cs.LG / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces REOPOLD (Relaxed On-Policy Distillation), reframing on-policy distillation as policy optimization by using the teacher-student log-likelihood ratio as a token reward.
- It stabilizes optimization by relaxing strict imitation constraints through mixture-based reward clipping and entropy-based token-level dynamic sampling.
- It also uses a unified exploration-to-refinement training strategy to balance exploration and refinement during learning.
- Empirically, REOPOLD delivers 6.7–12x higher sample efficiency and up to 3.32x inference speedups, enabling a 7B student to match a 32B teacher on visual reasoning across math, vision, and tool-use tasks.




