Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

arXiv cs.LG / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to learn finite-window policies for tabular POMDPs using a model-based approach that converts finite history windows into a “superstate MDP.”
  • It argues that standard MDP planning becomes possible once a model of the superstate MDP is estimated, but emphasizes that collecting data from the original POMDP creates a sampling–target mismatch.
  • The authors propose a model estimation procedure for tabular POMDPs and provide a sample-complexity analysis for estimating the superstate MDP model from a single trajectory.
  • The analysis leverages a link between filter stability and concentration bounds for weakly dependent random variables to obtain tight guarantees.
  • Using value iteration on the learned superstate model, the method produces approximately optimal finite-window policies for the original POMDP.

Abstract

We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.