Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper systematically studies on-policy distillation (OPD) and pinpoints two key conditions for success: compatible “thinking patterns” between teacher and student, and the teacher providing genuinely novel capabilities beyond what the student has already encountered in training.
  • Using weak-to-strong reverse distillation, it finds that same-family 1.5B and 7B teachers can be distributionally indistinguishable from the student’s perspective, suggesting limited incremental benefit without true novelty.
  • Token-level probing shows successful OPD emerges from progressive alignment on high-probability tokens specifically at student-visited states, with a small shared token set accounting for most probability mass (97%–99%).
  • The authors propose two recovery strategies for failing OPD—off-policy cold start and teacher-aligned prompt selection—to help regain training effectiveness.
  • While OPD appears to offer a “free lunch” via dense token-level reward, the study argues this comes with costs and raises open questions about OPD scaling for long-horizon distillation.

Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.