Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
arXiv cs.LG / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper systematically studies on-policy distillation (OPD) and pinpoints two key conditions for success: compatible “thinking patterns” between teacher and student, and the teacher providing genuinely novel capabilities beyond what the student has already encountered in training.
- Using weak-to-strong reverse distillation, it finds that same-family 1.5B and 7B teachers can be distributionally indistinguishable from the student’s perspective, suggesting limited incremental benefit without true novelty.
- Token-level probing shows successful OPD emerges from progressive alignment on high-probability tokens specifically at student-visited states, with a small shared token set accounting for most probability mass (97%–99%).
- The authors propose two recovery strategies for failing OPD—off-policy cold start and teacher-aligned prompt selection—to help regain training effectiveness.
- While OPD appears to offer a “free lunch” via dense token-level reward, the study argues this comes with costs and raises open questions about OPD scaling for long-horizon distillation.
Related Articles

Black Hat Asia
AI Business
Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention
Dev.to
I scanned every major vibe coding tool for security. None scored above 90.
Dev.to
I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.
Dev.to
Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?
Reddit r/artificial