Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
arXiv cs.LG / 5/6/2026
📰 NewsModels & Research
Key Points
- The paper analyzes why on-policy distillation (OPD) sometimes fails, pinpointing two key bottlenecks: lack of exploration of informative states and unreliable teacher supervision during student rollouts.
- It introduces Uni-OPD, a unified framework that works across both LLMs and multimodal LLMs by using a dual-perspective optimization strategy.
- From the student side, Uni-OPD uses two data balancing methods to encourage exploration of student-generated informative states during training.
- From the teacher side, it proposes outcome-guided margin calibration to restore “order consistency” between token-level guidance and the final outcome reward, improving supervision quality.
- Experiments across 5 domains and 16 benchmarks (including single/multi-teacher, strong-to-weak, and cross-modal distillation) show Uni-OPD is both effective and broadly applicable, offering practical guidance for reliable OPD.
Related Articles

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)
Reddit r/LocalLLaMA

We measured the real cost of running a GPT-5.4 chatbot on live websites
Reddit r/artificial

AI ecosystems in China and US grow apart amid tech war
SCMP Tech