MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

arXiv cs.CL / 5/5/2026

📰 NewsModels & Research

共有:

Key Points

The paper identifies two key limitations of on-policy distillation (OPD): a single-teacher capability ceiling and instability from per-step errors that compound in agentic, long-horizon tasks.
It proposes MAD-OPD, which replaces a single distillation teacher with a multi-teacher debate that produces token-level supervision weighted by post-debate confidence.
To make OPD work better in agentic settings, it introduces OPAD, adding step-level sampling to mitigate multi-step error compounding during training.
The authors derive a task-adaptive divergence rule, using Jensen–Shannon divergence for agentic stability and reverse KL for code generation, and validate it theoretically and experimentally.
Experiments across multiple Qwen teacher–student sizes and five agentic/code benchmarks show MAD-OPD achieves first place in all configurations, improving agentic performance by +2.4% and code by +3.7% over a stronger single-teacher OPD baseline in one key setup.

Abstract

On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B

\to

4B setting it lifts the agentic average by

+2.4\%

and the code average by

+3.7\%

over the stronger single-teacher OPD.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/5DailyView insight →

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Dev.to

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Last Week in AI

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Reddit r/LocalLLaMA

Uber Shares What Happens When 1.500 AI Agents Hit Production

Reddit r/artificial

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

Reddit r/LocalLLaMA

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

Key Points

Abstract

💡 Insights using this article

Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Uber Shares What Happens When 1.500 AI Agents Hit Production

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer