HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
arXiv cs.LG / 2026/3/26
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper introduces Hybrid Distillation Policy Optimization (HDPO) to address RL “cliff” prompts in mathematical reasoning where all rollouts fail and RL gradients vanish.
- HDPO augments standard RL by detecting prompts with total rollout failure, generating privileged rollouts using ground-truth information, filtering to keep only correct solutions, and distilling the teacher’s token-level distribution into the student.
- Because the teacher and student share the same underlying weights (differing only by privileged input), the method provides a bounded realizability gap compared with cross-model distillation.
- The authors prove that with R=1 filtered privileged generation, HDPO recovers the optimal KL-regularized RL policy in a hard-threshold limit, giving theoretical justification for the approach.
- Experiments on OpenMathInstruct-2 using Qwen2.5-Math-1.5B-Instruct show improved coverage (pass@4 up +0.8–1.1%, pass@8 up +0.4–1.7%) while preserving greedy accuracy, with the distillation weight lambda controlling the exploration–exploitation balance.



