AI Navigate

PACED: Distillation at the Frontier of Student Competence

arXiv cs.AI / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that gradient signal-to-noise ratio in distillation vanishes at both pass-rate extremes, causing wasted compute, and introduces Paced to concentrate distillation on the frontier of a student model's competence with a pass-rate weight w(p)=p^α(1−p)^β.
  • It proves that the Beta kernel is a leading-order weight arising from the SNR structure of distillation and is minimax-robust with worst-case efficiency loss O(δ^2) under bounded misspecification.
  • Empirical results show that distillation from teacher to student with forward KL gains over the base model while maintaining low benchmark forgetting, and self-distillation with reverse KL yields additional improvements.
  • A two-stage forward-KL-then-reverse-KL schedule provides the strongest improvements on standard reasoning benchmarks, supporting a mode-coverage-then-consolidation view, and the approach requires only student rollouts, no architectural changes, and is compatible with any KL direction.

Abstract

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight w(p) = p^\alpha(1 - p)^\beta derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel w(p) = p^\alpha(1-p)^\beta is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only O(\delta^2). (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.