Efficient learning by implicit exploration in bandit problems with side observations

arXiv stat.ML / 4/28/2026

💬 OpinionModels & Research

Key Points

  • The paper studies online learning under partial observability that lies between full-information learning and bandit feedback, including a setting where the learner observes the losses of other actions depending on its chosen action and an environment-controlled observation system.
  • It introduces the first algorithm with near-optimal regret guarantees while not requiring prior knowledge of the observation system when selecting actions.
  • The authors also define a new partial-information framework for online combinatorial optimization with feedback ranging between semi-bandit and full feedback.
  • Because efficient prediction is not always possible in that setting, they propose an alternative algorithm that maintains similar properties but guarantees computational efficiency via a more complex tuning mechanism.
  • Both algorithms use a new exploration method called “implicit exploration,” which the paper argues is more efficient than earlier exploration strategies in both computational and information-theoretic terms.

Abstract

We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.