Trading off rewards and errors in multi-armed bandits

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies multi-armed bandits, focusing on the tension between accurately identifying each arm’s mean and maximizing cumulative reward.
  • It argues that arms explored most often become the most informative, while pure reward maximization tends to concentrate on only the best-performing arm.
  • The authors propose an algorithm that smoothly interpolates between these two goals and provides regret guarantees.
  • They establish theoretical performance limits by proving both upper and lower bounds and support the claims with empirical experiments.

Abstract

In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.

Trading off rewards and errors in multi-armed bandits | AI Navigate