Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes why Reverse Kullback-Leibler (RKL) divergence is effective for LLM distillation, showing it typically outperforms forward KL (FKL) by emphasizing dominant modes under vocabulary size and capacity mismatch.
  • It identifies a structural drawback of RKL: non-target gradients can increase target logits even when the student already matches the teacher, which reduces output diversity and can make the student overly confident.
  • The authors show RKL also provides weak supervision for non-target classes, leading to poor alignment for tail (less likely) categories.
  • To fix these issues, they introduce Diversity-aware RKL (DRKL), which removes the problematic gradient behavior and improves non-target supervision while retaining RKL’s optimization advantages.
  • Experiments across multiple datasets and model families indicate DRKL consistently beats FKL, RKL, and other distillation objectives, improving the fidelity–diversity trade-off.

Abstract

Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.