Co-Evolving Policy Distillation

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper unifies RLVR and OPD as post-training paradigms for consolidating multiple expert capabilities into a single model, but explains different failure modes: mixed RLVR can incur inter-capability divergence, while expert-then-OPD can miss teacher capabilities due to large behavior-pattern gaps.
It proposes Co-Evolving Policy Distillation (CoPD), which trains experts in parallel and injects OPD during each expert’s ongoing RLVR training, with experts acting as mutual teachers for bidirectional OPD.
By co-evolving experts with bidirectional OPD, CoPD produces more consistent behavioral patterns across experts while preserving enough complementary knowledge.
Experiments show CoPD achieves strong “all-in-one” integration across text, image, and video reasoning and significantly outperforms baselines like mixed RLVR and MOPD, even beating domain-specific experts.
The authors argue CoPD’s model-parallel training setup could suggest a new training scaling paradigm for future systems.

Abstract

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.