Vanishing L2 regularization for the softmax Multi Armed Bandit

arXiv stat.ML / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies multi-armed bandit (MAB) algorithms that use a softmax mapping to choose policies, focusing on softmax policy gradients derived from subtracting an L2 regularization (quadratic) term from the mean reward.
  • It addresses a gap in prior work: existing convexity-based analyses did not provide an appropriate framework to prove convergence when the regularization strength goes to zero.
  • The authors provide new theoretical convergence results for the “vanishing L2 regularization” regime, establishing guarantees for how the method behaves as the regularization parameter vanishes.
  • They also run empirical experiments on standard benchmarks to validate that, in practice, the near-zero L2 regularization setting can be numerically advantageous.
  • Overall, the work links theory and practice by showing both provable convergence and improved numerical behavior for this specific softmax MAB/policy-gradient setup.

Abstract

Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal policy and served as the foundation for downstream algorithms, including REINFORCE. Distinct from vanilla approaches, we consider here the L2 regularized softmax policy gradient where a quadratic term is subtracted from the mean reward. Previous studies exploiting convexity failed to identify a suitable theoretical framework to analyze its convergence when the regularization parameter vanishes. We prove here theoretical convergence results and confirm empirically that this regime makes the L2 regularization numerically advantageous on standard benchmarks.