Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

arXiv cs.LG / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • On-policy distillation (OPD) for LLM post-training is attractive because it scores teacher feedback on student rollouts rather than fixed teacher traces, but the common sampled-token variant becomes fragile in long-horizon settings as rollouts drift from teacher-typical prefixes.
  • The paper analyzes estimator and implementation aspects, noting that token-level OPD is biased compared with sequence-level reverse-KL, while offering a tighter worst-case variance bound; experiments show that stronger future-reward coupling increases gradient variance and destabilizes learning.
  • It identifies three concrete failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions from tokenizer/special-token mismatches.
  • The authors propose simple fixes using teacher top-K local support matching via truncated reverse-KL with top-p rollout sampling and special-token masking, which improves optimization stability and downstream performance across math and agentic multi-task settings.

Abstract

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.
広告