Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

arXiv cs.AI / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper introduces Counterfactual Credit Policy Optimization (CCPO) to address reinforcement learning credit assignment problems in collaborative multi-agent LLM systems with shared rewards.
  • CCPO estimates each agent’s marginal contribution using counterfactual trajectories that remove an agent’s contribution, producing agent-specific learning signals and reducing update variance and free-riding.
  • It further improves stability across heterogeneous tasks and data distributions via a global-history-aware advantage normalization calibrated with global rollout statistics.
  • Experiments on sequential Think–Reason and multi-agent voting collaboration topologies show CCPO mitigates free-riding and outperforms strong multi-agent RL baselines on mathematical and logical reasoning benchmarks.
  • The authors provide an implementation at the linked GitHub repository for applying the CCPO framework in collaborative LLM training.

Abstract

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.