Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

arXiv cs.LG / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies stochastic multi-objective bandits, asking whether their added Pareto-regret complexity makes them fundamentally harder than single-objective bandits.
  • It shows that in the stochastic setting, Pareto regret is governed by the maximum sub-optimality gap g^†, implying a scaling of Ω(K log T / g^†) and an optimal dependence on this quantity.
  • The authors propose a new algorithm that achieves Pareto regret of order O(K log T / g^†), establishing optimality under the paper’s framework.
  • The method uses a nested two-layer uncertainty quantification (upper/lower confidence bounds) over both arm choices and objective dimensions, combining top-two racing with an uncertainty-greedy rule for dimension selection.
  • Numerical experiments are reported to confirm the theoretical regret guarantee and demonstrate substantial improvements over benchmark approaches.

Abstract

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(\Omega(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.