Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

arXiv cs.LG / 5/6/2026

📰 NewsModels & Research

Key Points

  • The paper analyzes why on-policy distillation (OPD) sometimes fails, pinpointing two key bottlenecks: lack of exploration of informative states and unreliable teacher supervision during student rollouts.
  • It introduces Uni-OPD, a unified framework that works across both LLMs and multimodal LLMs by using a dual-perspective optimization strategy.
  • From the student side, Uni-OPD uses two data balancing methods to encourage exploration of student-generated informative states during training.
  • From the teacher side, it proposes outcome-guided margin calibration to restore “order consistency” between token-level guidance and the final outcome reward, improving supervision quality.
  • Experiments across 5 domains and 16 benchmarks (including single/multi-teacher, strong-to-weak, and cross-modal distillation) show Uni-OPD is both effective and broadly applicable, offering practical guidance for reliable OPD.

Abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.