CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies RLHF-style learning for multi-agent LLM systems where each agent’s training signal is “filtered” by the system’s routing or collaboration mechanism.
  • It argues that standard single-policy RLHF objectives become misspecified under selection-gated feedback (routing) and shared rewards that hide individual contributions (collaboration).
  • The authors propose CoFi-PGMA, a unified framework that builds a counterfactual per-agent objective using marginal contribution to correct the learning signal in both settings.
  • For routing, the objective yields off-policy corrections for selection-gated feedback, while for collaboration it becomes leave-one-out difference rewards for credit assignment.
  • The work includes analysis of how softmax routing creates risk-sensitive incentives and provides practical multiturn-aware training algorithms, validated on a real-world reasoning dataset.

Abstract

Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal received by each agent is filtered by the system mechanism. Routing produces selection-gated feedback where only the chosen response is evaluated, while collaboration produces shared rewards that obscure the individual contribution of each agent. As a result, standard RLHF objectives designed for a single deployed policy become misspecified. We introduce CoFi-PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs), a unified framework for learning under filtered feedback in multi-agent LLM systems. Our approach derives a counterfactual per-agent training objective based on marginal contribution, which corrects the learning signal under both routing and collaborative mechanisms. For routing systems, the objective corresponds to off-policy corrections for selection-gated feedback, while for collaborative systems it reduces to leave-one-out difference rewards for credit assignment. We further analyze how softmax routing induces risk-sensitive incentives and provide practical training algorithms that integrate counterfactual estimators, multiturn-aware rewards, and policy optimization methods, and demonstrate the approach on a real-world reasoning dataset.

CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs | AI Navigate