Online learning with Erd\H{o}s-R\'enyi side-observation graphs

arXiv stat.ML / 4/29/2026

📰 NewsModels & Research

Key Points

  • The paper studies adversarial multi-armed bandit learning where the learner can sometimes observe the losses of non-chosen arms via side-observation graphs.
  • It assumes each non-selected arm reveals its loss independently with an unknown fixed probability r, and proposes two algorithms tailored to different ranges of r.
  • For the case r ≥ (log T)/(2N), the first algorithm attains expected regret O(√((T/r) log N)) after T rounds with N arms.
  • For smaller r, the second algorithm improves the bound to O(√((T/r) log (N+T))) and the authors also provide a procedure to estimate which r-regime applies.
  • The regret bounds are shown to match (up to logarithmic factors) the best performance achievable even by algorithms that are allowed to know r in advance.

Abstract

We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability r, independently of each other and the action of the learner. We propose two algorithms that work for different ranges of r. We show that after T rounds in a bandit problem with N arms, the expected regret of our first algorithm is O(\sqrt{(T /r) \log N }) whenever r\ge(\log T)/(2N), while our second algorithm achieves a regret of O(\sqrt{(T/r) \log (N+T)}) for smaller values of r. We also give a quick estimation procedure that decides the range of~r. All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~r.