AI Navigate

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The authors study a continuous-time diffusion approximation of policy gradient for k-armed stochastic bandits.
  • They prove that with learning rate eta = O(Delta^2 / log(n)) the regret is O(k log(k) log(n) / eta).
  • They construct an instance with only logarithmically many arms for which the regret is linear unless eta = O(Delta^2).
  • The results provide guidance on selecting learning rates to balance exploration and regret in diffusion-based policy gradient methods for bandits.

Abstract

We study a continuous-time diffusion approximation of policy gradient for k-armed stochastic bandits. We prove that with a learning rate \eta = O(\Delta^2/\log(n)) the regret is O(k \log(k) \log(n) / \eta) where n is the horizon and \Delta the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless \eta = O(\Delta^2).