Abstract
We study a continuous-time diffusion approximation of policy gradient fork-armed stochastic bandits. We prove that with a learning rate \eta = O(\Delta^2/\log(n)) the regret is O(k \log(k) \log(n) / \eta) where n is the horizon and \Delta the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless \eta = O(\Delta^2).
