PACED: Distillation at the Frontier of Student Competence

arXiv cs.AI / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper shows that gradient signal-to-noise ratio in distillation vanishes at both pass-rate extremes, causing wasted compute, and introduces Paced to concentrate distillation on the frontier of a student model's competence with a pass-rate weight w(p)=p^α(1−p)^β.
It proves that the Beta kernel is a leading-order weight arising from the SNR structure of distillation and is minimax-robust with worst-case efficiency loss O(δ^2) under bounded misspecification.
Empirical results show that distillation from teacher to student with forward KL gains over the base model while maintaining low benchmark forgetting, and self-distillation with reverse KL yields additional improvements.
A two-stage forward-KL-then-reverse-KL schedule provides the strongest improvements on standard reasoning benchmarks, supporting a mode-coverage-then-consolidation view, and the approach requires only student rollouts, no architectural changes, and is compatible with any KL direction.

Abstract

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight

w(p) = p^\alpha(1 - p)^\beta

derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel

w(p) = p^\alpha(1-p)^\beta

is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only

O(\delta^2)

. (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

The Batch

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development

Dev.to

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems

Dev.to

KI in der amtlichen Recherche beim DPMA: Was Patentanwälte bei Neuanmeldungen jetzt beachten sollten (Stand: März 2026)

Dev.to

PACED: Distillation at the Frontier of Student Competence

Key Points

Abstract

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Lessons from Academic Plagiarism Tools for SaaS Product Development

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems

KI in der amtlichen Recherche beim DPMA: Was Patentanwälte bei Neuanmeldungen jetzt beachten sollten (Stand: März 2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Lessons from Academic Plagiarism Tools for SaaS Product Development

**Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems**

KI in der amtlichen Recherche beim DPMA: Was Patentanwälte bei Neuanmeldungen jetzt beachten sollten (Stand: März 2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems