Concave Statistical Utility Maximization Bandits via Influence-Function Gradients

arXiv cs.LG / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies stochastic multi-armed bandits where the goal is to maximize a statistical functional of the long-run reward distribution (via a concave utility) rather than just maximizing expected reward.
It shows that, under mild continuity assumptions, the infinite-horizon bandit problem can be reformulated as optimizing a stationary mixture over policies parameterized by weights on the simplex.
For differentiable concave utilities, the authors derive stochastic gradient estimators from bandit feedback using influence-function calculus.
They propose an entropic mirror-ascent algorithm on a truncated simplex with multiplicative-weights updates, and they analyze regret bounds separating optimization error from bias due to influence-function estimation.
The method is applied to general concave distributional utilities, including variance and Wasserstein objectives, with experiments comparing exact versus plug-in influence-function implementations.

Abstract

We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concave distributional utilities and illustrated through variance and Wasserstein objectives, with numerical experiments comparing exact and plug-in influence-function implementations.