Asymptotically and Minimax Optimal Regret Bounds for Multi-Armed Bandits with Abstention

arXiv stat.ML / 2026/3/24

💬 オピニオンIdeas & Deep AnalysisModels & Research

要点

  • The paper extends the classic multi-armed bandit framework by adding an “abstention” action, letting the agent choose to abstain from accepting stochastic rewards and thereby incur a fixed regret or receive a guaranteed reward.
  • It addresses whether computationally efficient algorithms can achieve both asymptotically optimal and minimax-optimal regret under this abstention setting.
  • The authors propose and analyze new algorithms whose regret bounds match information-theoretic lower bounds, establishing optimality in a rigorous sense.
  • The study provides quantitative insight into how abstention improves performance compared with standard bandits, supported by extensive numerical experiments showing practical benefits alongside the theory.
  • The work is framed as groundwork for applying abstention-style options to other online decision-making problems beyond multi-armed bandits.

Abstract

We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic innovation: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. This added layer of complexity naturally prompts the key question: can we develop algorithms that are both computationally efficient and asymptotically and minimax optimal in this setting? We answer this question in the affirmative by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Extensive numerical experiments validate our theoretical results, demonstrating that our approach not only advances theory but also has the potential to deliver significant practical benefits.