Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
arXiv cs.CL / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Agent Q-Mix, a reinforcement learning framework that learns how to select and connect agents in LLM multi-agent systems by treating topology selection as a cooperative MARL problem.
- It uses decentralized communication decisions with QMIX value factorization, where agents jointly form a round-wise communication graph by choosing communication actions.
- The architecture combines a topology-aware GNN encoder, GRU-based memory, and per-agent Q-heads within a CTDE (centralized training, decentralized execution) setup.
- Agent Q-Mix optimizes a reward that trades off task accuracy against token cost, aiming for both performance and efficiency.
- Across seven coding/reasoning/math benchmarks—including Humanity’s Last Exam (HLE)—the method reports higher average accuracy and better token efficiency/robustness than prior approaches, including a reported 20.8% HLE accuracy with Gemini-3.1-Flash-Lite.




