Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
arXiv cs.LG / 4/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies stochastic multi-armed bandits where the goal is to maximize a statistical functional of the long-run reward distribution (via a concave utility) rather than just maximizing expected reward.
- It shows that, under mild continuity assumptions, the infinite-horizon bandit problem can be reformulated as optimizing a stationary mixture over policies parameterized by weights on the simplex.
- For differentiable concave utilities, the authors derive stochastic gradient estimators from bandit feedback using influence-function calculus.
- They propose an entropic mirror-ascent algorithm on a truncated simplex with multiplicative-weights updates, and they analyze regret bounds separating optimization error from bias due to influence-function estimation.
- The method is applied to general concave distributional utilities, including variance and Wasserstein objectives, with experiments comparing exact versus plug-in influence-function implementations.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to