TIP: Token Importance in On-Policy Distillation

arXiv cs.LG / 4/16/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies on-policy knowledge distillation (OPD) and identifies which token positions provide the most useful learning signal to the student during its own rollouts.
It proposes TIP, a two-axis taxonomy where informative tokens come either from high-student-entropy positions or from low-student-entropy positions that have high teacher–student divergence (overconfident but wrong).
Experiments show that sampling only the top 50% of tokens by student entropy can match or exceed full-token training while cutting peak memory by up to 47%.
A second sampling rule that targets low-entropy, high-divergence tokens enables training on fewer than 10% of tokens to nearly match full-token baselines, indicating that overconfident errors contain dense corrective information.
The authors validate TIP across multiple teacher–student pairs (Qwen3, Llama, Qwen2.5) on MATH and AIME benchmarks and DeepPlanning, and provide implementation updates by extending the OPD repository to support memory-efficient distillation under limited GPU budgets.

Abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining

50\%

of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to

47\%

. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than

10\%

of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.