Fast Best-in-Class Regret for Contextual Bandits

arXiv stat.ML / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

論文は、確率的文脈バンディットの「エイゴスティック（非実現可能）」設定で、損失・報酬に制約を置かずにクラス内の最良方策（best-in-class）に対する後悔（regret）を達成する問題を扱っています。
すべてのラウンドで悲観的（pessimistic）目的関数を最小化して方策を更新し、その目的はクリップ付き逆確率重み付け（clipped inverse-propensity estimate）の方策価値に分散ペナルティを加えた形になっています。
方策クラスに対するエントロピー仮定と、マージン条件の一般化であるHölder型の誤差境界条件を用いることで、best-in-classに対する「最初の高速率（first fast rate）」の後悔保証を示しています。
分析では、適応的なデータ収集下でも悲観性を保証するために、bounded martingale empirical processesに対する逐次自己正規化の最大不等式を用いた一様に分散適応的な信頼区間を構築しています。

Abstract

We study the problem of stochastic contextual bandits in the agnostic setting, where the goal is to compete with the best policy in a given class without assuming realizability or imposing model restrictions on losses or rewards. In this work, we establish the first fast rate for regret relative to the best-in-class policy. Our proposed algorithm updates the policy at every round by minimizing a pessimistic objective, defined as a clipped inverse-propensity estimate of the policy value plus a variance penalty. By leveraging entropy assumptions on the policy class and a H\"olderian error-bound condition (a generalization of the margin condition), we achieve fast best-in-class regret rates, including polylogarithmic rates in the parametric case. The analysis is driven by a sequential self-normalized maximal inequality for bounded martingale empirical processes, which yields uniform variance-adaptive confidence bounds and guarantees pessimism under adaptive data collection.