Co-Evolving Policy Distillation
arXiv cs.LG / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper unifies RLVR and OPD as post-training paradigms for consolidating multiple expert capabilities into a single model, but explains different failure modes: mixed RLVR can incur inter-capability divergence, while expert-then-OPD can miss teacher capabilities due to large behavior-pattern gaps.
- It proposes Co-Evolving Policy Distillation (CoPD), which trains experts in parallel and injects OPD during each expert’s ongoing RLVR training, with experts acting as mutual teachers for bidirectional OPD.
- By co-evolving experts with bidirectional OPD, CoPD produces more consistent behavioral patterns across experts while preserving enough complementary knowledge.
- Experiments show CoPD achieves strong “all-in-one” integration across text, image, and video reasoning and significantly outperforms baselines like mixed RLVR and MOPD, even beating domain-specific experts.
- The authors argue CoPD’s model-parallel training setup could suggest a new training scaling paradigm for future systems.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER