A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper provides a Lyapunov-based theoretical analysis of softmax policy gradient methods for stochastic multi-armed bandits in discrete time, adapting prior continuous-time results.
  • It establishes a regret bound under a specific learning-rate choice, linking performance to the minimum/maximum action-value gaps ($\Delta_{min}$, $\Delta_{max}$) and the horizon ($n$).
  • The proposed learning-rate schedule scales as $\eta = O(\Delta_{min}^2/(\Delta_{max}\log(n)))$, which is central to the derived regret guarantee.
  • The resulting regret is shown to be $O(k\log(k)\log(n)/\eta)$, where $k$ denotes the number of arms, giving an explicit dependence on problem structure and training time.
  • Overall, the work strengthens the theoretical understanding of how softmax policy gradient behaves for stochastic bandit problems by supplying a stability-style proof technique (Lyapunov analysis).

Abstract

We adapt the analysis of policy gradient for continuous time k-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate \eta = O(\Delta_{\min}^2/(\Delta_{\max} \log(n))) the regret is O(k \log(k) \log(n) / \eta) where n is the horizon and \Delta_{\min} and \Delta_{\max} are the minimum and maximum gaps.