Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study benchmarks 35 open-weight LLMs by having them participate in six behavioral-economics games that test different cooperation mechanisms under shared constraints.
  • It finds that “cooperative profiles” derived from these game behaviors strongly and robustly predict how well LLM agents perform together on AI-for-Science workflows.
  • In particular, models that learn to coordinate well in the games and that favor multiplicative (team-amplifying) strategies outperform greedy approaches in producing scientific reports.
  • The predictive relationship remains after controlling for multiple factors, suggesting that cooperative disposition is a measurable, distinct property of LLMs rather than just general capability.
  • The authors propose a behavioral-games framework as a fast, low-cost diagnostic tool to screen for cooperative “fitness” before deploying expensive multi-agent systems.

Abstract

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.