HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv cs.LG / 2026/3/26

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper introduces Hybrid Distillation Policy Optimization (HDPO) to address RL “cliff” prompts in mathematical reasoning where all rollouts fail and RL gradients vanish.
HDPO augments standard RL by detecting prompts with total rollout failure, generating privileged rollouts using ground-truth information, filtering to keep only correct solutions, and distilling the teacher’s token-level distribution into the student.
Because the teacher and student share the same underlying weights (differing only by privileged input), the method provides a bounded realizability gap compared with cross-model distillation.
The authors prove that with R=1 filtered privileged generation, HDPO recovers the optimal KL-regularized RL policy in a hard-threshold limit, giving theoretical justification for the approach.
Experiments on OpenMathInstruct-2 using Qwen2.5-Math-1.5B-Instruct show improved coverage (pass@4 up +0.8–1.1%, pass@8 up +0.4–1.7%) while preserving greedy accuracy, with the distillation weight lambda controlling the exploration–exploitation balance.

Abstract

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

テクノロジー「AI警告危険人物」

note

裏カツ164日目！アメリア#AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

note

ぽんず｜管理職のAI仕事術

note

AIに丸投げしたら「自分の言葉」が消えた40代管理職の話

note

#2 : プロンプト研究講座【第18回】複数キャラクターの関係性の描き方

note

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

要点

Abstract

関連記事

テクノロジー「AI警告危険人物」

裏カツ164日目！アメリア#AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

ぽんず｜管理職のAI仕事術

AIに丸投げしたら「自分の言葉」が消えた40代管理職の話

#2 : プロンプト研究講座【第18回】複数キャラクターの関係性の描き方

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer