Replicable Bandits with UCB based Exploration

arXiv cs.LG / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies “replicable” UCB-based algorithms for stochastic multi-armed bandits and linear bandits, requiring that two runs with shared randomness but different rewards follow the same action sequence with high probability.
It proposes RepUCB for stochastic multi-armed bandits, providing an explicit regret bound whose dependence on time, arms, and the replicability parameter ρ is quantified.
For stochastic linear bandits, it introduces RepRidge, a replicable ridge regression estimator with both confidence and ρ-replicability guarantees, which is then used to build RepLinUCB.
The authors show RepLinUCB achieves a regret bound of fO((d + d^3/ρ)√T) in tilde notation, improving the best previous guarantee by a factor of about O(d/ρ), i.e., it reduces the “price of replicability.”

Abstract

We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is

\rho

-replicable if two executions using shared internal randomness but independent reward realizations, produce the same action sequence with probability at least

1-\rho

. Prior work is primarily elimination-based and, in linear bandits with infinitely many actions, relies on discretization, leading to suboptimal dependence on the dimension

d

and

\rho

. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret

O\!\left(\frac{K^2\log^2 T}{\rho^2}\sum_{a:\Delta_a>0}\left(\Delta_a+\frac{\log(KT\log T)}{\Delta_a}\right)\right)

. For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a

\rho

-replicability guarantee. Beyond its role in our bandit algorithm, this estimator and its guarantees may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by

\widetilde{O}\!\big(\big(d+\frac{d^3}{\rho}\big)\sqrt{T}\big)

. This improves the best prior regret guarantee by a factor of

O(d/\rho)

, showing that our optimistic algorithm can substantially reduce the price of replicability.

Training ChatGPT on Private Data: A Technical Reference

Dev.to

AI as a Fascist Artifact

Dev.to

Sony Ace: el robot que ganó 3 de 5 a élites de ping-pong en Nature

Dev.to

OpenAI releases open-source model that strips personal data from text

THE DECODER

Researchers warn US politics is repeating its ChatGPT mistake with world models

THE DECODER

Replicable Bandits with UCB based Exploration

Key Points

Abstract

Related Articles

Training ChatGPT on Private Data: A Technical Reference

AI as a Fascist Artifact

Sony Ace: el robot que ganó 3 de 5 a élites de ping-pong en Nature

OpenAI releases open-source model that strips personal data from text

Researchers warn US politics is repeating its ChatGPT mistake with world models

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer