Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses distribution shift in offline multi-agent reinforcement learning for general-sum games and shows that KL regularization can stabilize learning without needing manually tuned pessimistic penalties.
  • It introduces General-sum Anchored Nash Equilibrium (GANE) to recover regularized Nash equilibria with an accelerated statistical rate of roughly \(\tilde{O}(1/n)\).
  • For practical computation, the authors propose General-sum Anchored Mirror Descent (GAMD), an iterative method that converges to a Coarse Correlated Equilibrium with a standard rate of about \(\tilde{O}(1/\sqrt{n} + 1/T)\).
  • Overall, the work positions KL regularization as a standalone mechanism for “pessimism-free” offline learning in multi-player general-sum settings, matching or improving previously used approaches.

Abstract

Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of \widetilde{O}(1/n). For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of \widetilde{O}(1/\sqrt{n}+1/T). These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.