Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

arXiv cs.LG / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper studies how to learn minimax policies in zero-sum matrix games under bandit-style feedback, focusing on achieving last-iterate convergence.
  • Prior work (Fiegel et al., 2025) showed that when players are uncoupled, last-iterate convergence is fundamentally harder, with a lower bound of order Ω(t^{-1/4}) on the exploitability gap.
  • The authors propose online mirror descent with log-barrier regularization and a dual-focused analysis, proving a high-probability convergence rate of O~(t^{-1/4}) (up to log factors).
  • They further extend the approach to extensive-form games, obtaining a similar O~(t^{-1/4}) rate for last-iterate convergence.

Abstract

We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^{-1/4}). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^{-1/4}) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.