Thompson Sampling for Infinite-Horizon Discounted Decision Processes

arXiv stat.ML / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies learning in discounted infinite-horizon MDPs with Borel (possibly continuous) state and action spaces where rewards and transitions depend on an unknown parameter.
  • It introduces a canonical probability space to support analysis of adaptive sampling-based algorithms, addressing the challenges of defining learning for such settings.
  • Because standard regret notions do not directly fit infinite-horizon policy evaluation, the authors propose decomposed metrics that split regret into expected finite-time regret, expected state regret, and expected residual regret.
  • Focusing on Thompson sampling, the paper proves that the residual regret term converges to zero exponentially fast under assumptions extending prior finite-space results to the Borel setting.
  • It further shows almost-sure convergence of a probabilistic residual regret variant and concludes that Thompson sampling achieves complete learning in the model.

Abstract

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret with respect to the optimal reward from the current period onward, disregarding the irreversible consequences of past decisions. We further disaggregate this term by introducing the probabilistic residual regret, a finer, sample-path version of (iii) that captures the remaining loss in future performance from the current period onward, conditional on the observed history. Its expectation coincides with (iii). We then focus on Thompson sampling (TS); under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast. We further show that, under mild conditions ensuring the existence of the relevant limits, its probabilistic counterpart converges to zero almost surely and TS achieves complete learning.