HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Hybrid Distillation Policy Optimization (HDPO) to address RL “cliff” prompts in mathematical reasoning where all rollouts fail and RL gradients vanish.
HDPO augments standard RL by detecting prompts with total rollout failure, generating privileged rollouts using ground-truth information, filtering to keep only correct solutions, and distilling the teacher’s token-level distribution into the student.
Because the teacher and student share the same underlying weights (differing only by privileged input), the method provides a bounded realizability gap compared with cross-model distillation.
The authors prove that with R=1 filtered privileged generation, HDPO recovers the optimal KL-regularized RL policy in a hard-threshold limit, giving theoretical justification for the approach.
Experiments on OpenMathInstruct-2 using Qwen2.5-Math-1.5B-Instruct show improved coverage (pass@4 up +0.8–1.1%, pass@8 up +0.4–1.7%) while preserving greedy accuracy, with the distillation weight lambda controlling the exploration–exploitation balance.

Abstract

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

Dev.to

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

"Why Your AI Agent Needs a System 1"

Dev.to

When should we expect TurboQuant?

Reddit r/LocalLLaMA

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

Dev.to

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

Key Points

Abstract

Related Articles

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

AgentDesk vs Hiring Another Consultant: A Cost Comparison

"Why Your AI Agent Needs a System 1"

When should we expect TurboQuant?

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer