Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

arXiv cs.LG / 3/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • This paper studies Online Convex Optimization (OCO) under adversarial losses where the learner receives two-point bandit feedback (only function values at two queried points).
  • It addresses a previously open problem of obtaining tight high-probability regret bounds for OCO with strongly convex losses, which were known to be difficult due to heavy-tailed bandit gradient estimators.
  • The authors prove the first high-probability regret bound of order O(d(\log T + \log(1/\delta))/\mu) for \mu-strongly convex losses.
  • The bound is shown to be minimax optimal with respect to both the time horizon T and the dimension d, improving the theoretical guarantee beyond what prior analyses could achieve.
  • Overall, the work advances the theoretical understanding of learning with bandit feedback by overcoming concentration-analysis challenges posed by heavy-tailed estimators.

Abstract

We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of O(d(\log T + \log(1/\delta))/\mu) for \mu-strongly convex losses. Our result is minimax optimal with respect to both the time horizon T and the dimension d.
広告