Hybrid Policy Distillation for LLMs

arXiv cs.CL / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper analyzes how existing knowledge distillation (KD) methods for LLMs differ in divergence direction, optimization strategy, and data regime, and reframes KD as a token-level reweighted log-likelihood objective.
  • It introduces Hybrid Policy Distillation (HPD), which blends forward and reverse KL to better balance mode coverage (mode-seeking vs coverage trade-off).
  • HPD also combines off-policy data with lightweight, approximate on-policy sampling to reduce the cost of full on-policy training.
  • Experiments on long-form math reasoning and short-form dialogue/code tasks show HPD improves optimization stability, computational efficiency, and final performance across multiple model families and scales.
  • The authors provide accompanying code on GitHub to enable reproduction and further experimentation.

Abstract

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.