Abstract
We adapt the analysis of policy gradient for continuous timek-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate \eta = O(\Delta_{\min}^2/(\Delta_{\max} \log(n))) the regret is O(k \log(k) \log(n) / \eta) where n is the horizon and \Delta_{\min} and \Delta_{\max} are the minimum and maximum gaps.


