A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper provides a Lyapunov-based theoretical analysis of softmax policy gradient methods for stochastic multi-armed bandits in discrete time, adapting prior continuous-time results.
It establishes a regret bound under a specific learning-rate choice, linking performance to the minimum/maximum action-value gaps ($\Delta_{min}$, $\Delta_{max}$) and the horizon ($n$).
The proposed learning-rate schedule scales as $\eta = O(\Delta_{min}^2/(\Delta_{max}\log(n)))$, which is central to the derived regret guarantee.
The resulting regret is shown to be $O(k\log(k)\log(n)/\eta)$, where $k$ denotes the number of arms, giving an explicit dependence on problem structure and training time.
Overall, the work strengthens the theoretical understanding of how softmax policy gradient behaves for stochastic bandit problems by supplying a stability-style proof technique (Lyapunov analysis).

We adapt the analysis of policy gradient for continuous time

k

-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate

\eta = O(\Delta_{\min}^2/(\Delta_{\max} \log(n)))

the regret is

O(k \log(k) \log(n) / \eta)

where

n

is the horizon and

\Delta_{\min}

and

\Delta_{\max}

are the minimum and maximum gaps.

Dev.to

Dev.to

Dev.to

Dev.to

Dev.to