MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

arXiv cs.CL / 5/5/2026

📰 NewsModels & Research

Key Points

  • The paper identifies two key limitations of on-policy distillation (OPD): a single-teacher capability ceiling and instability from per-step errors that compound in agentic, long-horizon tasks.
  • It proposes MAD-OPD, which replaces a single distillation teacher with a multi-teacher debate that produces token-level supervision weighted by post-debate confidence.
  • To make OPD work better in agentic settings, it introduces OPAD, adding step-level sampling to mitigate multi-step error compounding during training.
  • The authors derive a task-adaptive divergence rule, using Jensen–Shannon divergence for agentic stability and reverse KL for code generation, and validate it theoretically and experimentally.
  • Experiments across multiple Qwen teacher–student sizes and five agentic/code benchmarks show MAD-OPD achieves first place in all configurations, improving agentic performance by +2.4% and code by +3.7% over a stronger single-teacher OPD baseline in one key setup.

Abstract

On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B\to4B setting it lifts the agentic average by +2.4\% and the code average by +3.7\% over the stronger single-teacher OPD.