Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

arXiv cs.LG / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies stochastic multi-objective bandits, asking whether their added Pareto-regret complexity makes them fundamentally harder than single-objective bandits.
It shows that in the stochastic setting, Pareto regret is governed by the maximum sub-optimality gap g^†, implying a scaling of Ω(K log T / g^†) and an optimal dependence on this quantity.
The authors propose a new algorithm that achieves Pareto regret of order O(K log T / g^†), establishing optimality under the paper’s framework.
The method uses a nested two-layer uncertainty quantification (upper/lower confidence bounds) over both arm choices and objective dimensions, combining top-two racing with an uncertainty-greedy rule for dimension selection.
Numerical experiments are reported to confirm the theoretical regret guarantee and demonstrate substantial improvements over benchmark approaches.

Abstract

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(\Omega(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer