Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

arXiv stat.ML / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies $(
  • 03b5,
  • 03b4)$-PAC policy identification in finite-horizon episodic tabular Markov Decision Processes, focusing on both statistical (sample complexity) and computational efficiency.
  • It criticizes existing finite-time approaches for being computationally expensive and for having suboptimal dependence on $
  • log(1/
  • 03b4)$, which can make them less practical and theoretically tight.
  • The authors propose a randomized, computationally efficient best-policy identification algorithm that combines posterior sampling with an online learning strategy to drive exploration in the MDP.
  • The method is shown to be asymptotically optimal in sample complexity, including alignment with posterior contraction rates, and it achieves a per-episode runtime of $O(S^2AH)$.
  • Compared to prior methods such as MOCA and PEDEL, the new guarantees remain meaningful in the asymptotic regime and avoid unfavorable polynomial dependence on $
  • log(1/
  • 03b4)$, aiming to be both insightful and practically usable for tabular MDPs.

Abstract

We study the (\varepsilon, \delta)-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings (\varepsilon>0) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on \log(1/\delta). We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in O(S^2AH) per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on \log(1/\delta). Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.