COMPASS-Hedge: Learning Safely Without Knowing the World

arXiv cs.LG / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces COMPASS-Hedge, a new full-information online learning algorithm designed to resolve a common “trilemma” around adversarial regret, stochastic efficiency, and baseline safety relative to a fixed comparator.
  • COMPASS-Hedge is claimed to achieve minimax-optimal regret in adversarial settings, instance-/gap-dependent (instance-optimal) regret in stochastic settings, and only o(1) (log-factor-adjusted) regret versus a designated baseline policy.
  • The method is described as parameter-free, requiring no prior knowledge of whether the environment is adversarial or stochastic and no access to problem-dependent gap magnitudes.
  • The algorithm’s design combines adaptive pseudo-regret scaling with phase-based “aggression,” plus a comparator-aware mixing strategy to unify the three performance guarantees.

Abstract

Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) \tilde{\mathcal{O}}(1) regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.