The Price of Paranoia: Robust Risk-Sensitive Cooperation in Non-Stationary Multi-Agent Reinforcement Learning

arXiv cs.AI / 4/20/2026

💬 OpinionModels & Research

Key Points

  • Cooperative equilibria in multi-agent reinforcement learning can become unstable because agents co-learn: each agent’s gradient updates change its partner’s action distribution, adding noise exactly where cooperation is most sensitive.
  • The paper shows that even strongly Pareto-dominant cooperative equilibria are exponentially unstable under standard risk-neutral learning, and collapse irreversibly once partner-induced noise exceeds a critical threshold.
  • Applying “distributional robustness” in a naive way (e.g., making agents risk-averse over return distributions) can worsen instability because it penalizes high-variance cooperative actions more than defection.
  • The authors propose robustness aimed at the policy-gradient update variance caused by partner uncertainty, using an online measure of partner unpredictability to modulate gradients and expand the cooperation basin.
  • They introduce metrics—“Price of Paranoia” and a “Cooperation Window”—to jointly characterize stability, sample efficiency, and welfare recovery, and derive an optimal robustness level as a closed-form trade-off.

Abstract

Cooperative equilibria are fragile. When agents learn alongside each other rather than in a fixed environment, the process of learning destabilizes the cooperation they are trying to sustain: every gradient step an agent takes shifts the distribution of actions its partner will play, turning a cooperative partner into a source of stochastic noise precisely where the cooperation decision is most sensitive. We study how this co-learning noise propagates through the structure of coordination games, and find that the cooperative equilibrium, even when strongly Pareto-dominant, is exponentially unstable under standard risk-neutral learning, collapsing irreversibly once partner noise crosses the game's critical cooperation threshold. The natural response to apply distributional robustness to hedge against partner uncertainty makes things strictly worse: risk-averse return objectives penalize the high-variance cooperative action relative to defection, widening the instability region rather than shrinking it, a paradox that reveals a fundamental mismatch between the domains where robustness is applied and instability originates. We resolve this by showing that robustness should target the policy gradient update variance induced by partner uncertainty, not the return distribution. This distinction yields an algorithm whose gradient updates are modulated by an online measure of partner unpredictability, provably expanding the cooperation basin in symmetric coordination games. To unify stability, sample complexity, and welfare consequences of this approach, we introduce the Price of Paranoia as the structural dual of the Price of Anarchy. Together with a novel Cooperation Window, it precisely characterizes how much welfare learning algorithms can recover under partner noise, pinning down the optimal degree of robustness as a closed-form balance between equilibrium stability and sample efficiency.