Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

arXiv cs.LG / 3/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

This paper studies Online Convex Optimization (OCO) under adversarial losses where the learner receives two-point bandit feedback (only function values at two queried points).
It addresses a previously open problem of obtaining tight high-probability regret bounds for OCO with strongly convex losses, which were known to be difficult due to heavy-tailed bandit gradient estimators.
The authors prove the first high-probability regret bound of order O(d(\log T + \log(1/\delta))/\mu) for \mu-strongly convex losses.
The bound is shown to be minimax optimal with respect to both the time horizon T and the dimension d, improving the theoretical guarantee beyond what prior analyses could achieve.
Overall, the work advances the theoretical understanding of learning with bandit feedback by overcoming concentration-analysis challenges posed by heavy-tailed estimators.

Abstract

We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of

O(d(\log T + \log(1/\delta))/\mu)

for

\mu

-strongly convex losses. Our result is minimax optimal with respect to both the time horizon

T

and the dimension

d

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

The Redline Economy

Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks

Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

Dev.to

Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

Key Points

Abstract

Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

The Redline Economy

$500 GPU outperforms Claude Sonnet on coding benchmarks

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer