On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

arXiv stat.ML / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies the missing, fully characterized sample complexity of offline learning for multi-armed bandits when performance is measured with KL-regularized metrics.
It provides a sharp analysis of “KL-PCB,” showing different sample-complexity scalings depending on whether the KL regularization strength η is large or small.
For large regularization (η = ˜O(ε⁻¹)), it achieves sample complexity ˜O(η·S·A·C^{π*}/ε), while for small regularization (η = ˜Ω(ε⁻¹)) it requires ˜Õ( S·A·C^{π*}/ε²).
The authors also derive matching, sharper lower bounds across the entire range of η, yielding a near-complete characterization of KL-regularized offline multi-armed bandits.

Abstract

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of

\tilde{O}(\eta SAC^{\pi^*}/\epsilon)

under large regularization

\eta = \tilde{O}(\epsilon^{-1})

, and a sample complexity of

\tilde{\Omega}(SAC^{\pi^*}/\epsilon^2)

under small regularization

\eta = \tilde{\Omega}(\epsilon^{-1})

, where

\eta

is the regularization parameter,

S

is the number of contexts,

A

is the number of arms,

C^{\pi^*}

policy coverage coefficient at the optimal policy

\pi^*

\epsilon

is the desired sub-optimality, and

\tilde{O}

and

\tilde{\Omega}

hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.