Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
arXiv cs.AI / 2026/3/24
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper introduces Counterfactual Credit Policy Optimization (CCPO) to address reinforcement learning credit assignment problems in collaborative multi-agent LLM systems with shared rewards.
- CCPO estimates each agent’s marginal contribution using counterfactual trajectories that remove an agent’s contribution, producing agent-specific learning signals and reducing update variance and free-riding.
- It further improves stability across heterogeneous tasks and data distributions via a global-history-aware advantage normalization calibrated with global rollout statistics.
- Experiments on sequential Think–Reason and multi-agent voting collaboration topologies show CCPO mitigates free-riding and outperforms strong multi-agent RL baselines on mathematical and logical reasoning benchmarks.
- The authors provide an implementation at the linked GitHub repository for applying the CCPO framework in collaborative LLM training.
