Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
arXiv stat.ML / 4/23/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies a new multi-armed bandit setting where offline side information is transformed into surrogate rewards using pre-trained machine learning models, addressing the problem that online reward data is often scarce.
- It introduces the ML-assisted UCB (MLA-UCB) algorithm, which uses predicted rewards plus uncertainty to control the effect of biased surrogate rewards from offline extrapolation.
- Under a joint Gaussian assumption for predicted and true rewards, the method is proven to improve cumulative regret and remain asymptotically optimal even when the surrogate mean is misaligned with the true mean.
- The approach does not require prior knowledge of the covariance between true and surrogate rewards, and the paper extends it to batched bandits with potentially non-Gaussian rewards, providing computable confidence bounds and regret guarantees.
- Experiments—including simulations and real-world applications like language model selection and video recommendation—show consistent regret reductions with moderate surrogate sample sizes and correlations.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

10 AI Tools Every Developer Should Try in 2026
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to