HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv cs.LG / 2026/3/26

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper introduces Hybrid Distillation Policy Optimization (HDPO) to address RL “cliff” prompts in mathematical reasoning where all rollouts fail and RL gradients vanish.
  • HDPO augments standard RL by detecting prompts with total rollout failure, generating privileged rollouts using ground-truth information, filtering to keep only correct solutions, and distilling the teacher’s token-level distribution into the student.
  • Because the teacher and student share the same underlying weights (differing only by privileged input), the method provides a bounded realizability gap compared with cross-model distillation.
  • The authors prove that with R=1 filtered privileged generation, HDPO recovers the optimal KL-regularized RL policy in a hard-threshold limit, giving theoretical justification for the approach.
  • Experiments on OpenMathInstruct-2 using Qwen2.5-Math-1.5B-Instruct show improved coverage (pass@4 up +0.8–1.1%, pass@8 up +0.4–1.7%) while preserving greedy accuracy, with the distillation weight lambda controlling the exploration–exploitation balance.

Abstract

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.