Thompson Sampling for Infinite-Horizon Discounted Decision Processes
arXiv stat.ML / 4/9/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies learning in discounted infinite-horizon MDPs with Borel (possibly continuous) state and action spaces where rewards and transitions depend on an unknown parameter.
- It introduces a canonical probability space to support analysis of adaptive sampling-based algorithms, addressing the challenges of defining learning for such settings.
- Because standard regret notions do not directly fit infinite-horizon policy evaluation, the authors propose decomposed metrics that split regret into expected finite-time regret, expected state regret, and expected residual regret.
- Focusing on Thompson sampling, the paper proves that the residual regret term converges to zero exponentially fast under assumptions extending prior finite-space results to the Borel setting.
- It further shows almost-sure convergence of a probabilistic residual regret variant and concludes that Thompson sampling achieves complete learning in the model.
Related Articles

Black Hat Asia
AI Business
Research with ChatGPT
Dev.to
Silicon Valley is quietly running on Chinese open source models and almost nobody is talking about it
Reddit r/LocalLLaMA

Why AI Product Quality Is Now an Evaluation Pipeline Problem, Not a Model Problem
Dev.to

The 10 Best AI Tools for SEO and Digital Marketing in 2026
Dev.to