CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs
arXiv cs.LG / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies RLHF-style learning for multi-agent LLM systems where each agent’s training signal is “filtered” by the system’s routing or collaboration mechanism.
- It argues that standard single-policy RLHF objectives become misspecified under selection-gated feedback (routing) and shared rewards that hide individual contributions (collaboration).
- The authors propose CoFi-PGMA, a unified framework that builds a counterfactual per-agent objective using marginal contribution to correct the learning signal in both settings.
- For routing, the objective yields off-policy corrections for selection-gated feedback, while for collaboration it becomes leave-one-out difference rewards for credit assignment.
- The work includes analysis of how softmax routing creates risk-sensitive incentives and provides practical multiturn-aware training algorithms, validated on a real-world reasoning dataset.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to