Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes why Reverse Kullback-Leibler (RKL) divergence is effective for LLM distillation, showing it typically outperforms forward KL (FKL) by emphasizing dominant modes under vocabulary size and capacity mismatch.
It identifies a structural drawback of RKL: non-target gradients can increase target logits even when the student already matches the teacher, which reduces output diversity and can make the student overly confident.
The authors show RKL also provides weak supervision for non-target classes, leading to poor alignment for tail (less likely) categories.
To fix these issues, they introduce Diversity-aware RKL (DRKL), which removes the problematic gradient behavior and improves non-target supervision while retaining RKL’s optimization advantages.
Experiments across multiple datasets and model families indicate DRKL consistently beats FKL, RKL, and other distillation objectives, improving the fidelity–diversity trade-off.

Abstract

Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

v5.5.0

Transformers（HuggingFace）Releases

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Inference Engines - A visual deep dive into the layers of an LLM

Dev.to

Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

Reddit r/LocalLLaMA

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Key Points

Abstract

Related Articles

v5.5.0

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Inference Engines - A visual deep dive into the layers of an LLM

Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer