Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

arXiv stat.ML / 4/23/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies a new multi-armed bandit setting where offline side information is transformed into surrogate rewards using pre-trained machine learning models, addressing the problem that online reward data is often scarce.
It introduces the ML-assisted UCB (MLA-UCB) algorithm, which uses predicted rewards plus uncertainty to control the effect of biased surrogate rewards from offline extrapolation.
Under a joint Gaussian assumption for predicted and true rewards, the method is proven to improve cumulative regret and remain asymptotically optimal even when the surrogate mean is misaligned with the true mean.
The approach does not require prior knowledge of the covariance between true and surrogate rewards, and the paper extends it to batched bandits with potentially non-Gaussian rewards, providing computable confidence bounds and regret guarantees.
Experiments—including simulations and real-world applications like language model selection and video recommendation—show consistent regret reductions with moderate surrogate sample sizes and correlations.

Abstract

Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.