A Diffusion Analysis of Policy Gradient for Stochastic Bandits

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors study a continuous-time diffusion approximation of policy gradient for k-armed stochastic bandits.
They prove that with learning rate eta = O(Delta^2 / log(n)) the regret is O(k log(k) log(n) / eta).
They construct an instance with only logarithmically many arms for which the regret is linear unless eta = O(Delta^2).
The results provide guidance on selecting learning rates to balance exploration and regret in diffusion-based policy gradient methods for bandits.

We study a continuous-time diffusion approximation of policy gradient for

k

-armed stochastic bandits. We prove that with a learning rate

\eta = O(\Delta^2/\log(n))

the regret is

O(k \log(k) \log(n) / \eta)

where

n

is the horizon and

\Delta

the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless

\eta = O(\Delta^2)

Dev.to

Dev.to

Dev.to

Reddit r/MachineLearning

Reddit r/LocalLLaMA