Policy Testing in Markov Decision Processes

arXiv stat.ML / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies “policy testing” in discounted Markov decision processes (MDPs), aiming to determine whether a given policy’s value exceeds a threshold with high confidence while using as few samples as possible.
It establishes an instance-dependent lower bound for any reasonable algorithm, expressed as an optimization problem with non-convex constraints.
The authors propose a new algorithm based on reformulating the lower-bound problem by exchanging the roles of the objective and constraints, producing a problem with a non-convex objective but convex constraints.
This reformulation is interpreted as a policy optimization task in a newly defined “reversed MDP,” and the paper shows how the global KL constraint can be exactly decomposed into product-box subproblems solved via projected policy gradient with an outer budget search.
The work suggests that the reversed-MDP perspective and reformulation could extend to other pure-exploration problems in MDPs, such as policy evaluation and best-policy identification.

Abstract

We study the policy testing problem in discounted Markov decision processes (MDPs) in the fixed-confidence setting under a generative model with static sampling. The goal is to decide whether the value of a given policy exceeds a specified threshold while minimizing the number of samples. We first derive an instance-dependent lower bound that any reasonable algorithm must satisfy, characterized as the solution to an optimization problem with non-convex constraints. Guided by this formulation, we propose a new algorithm. While this design paradigm is common in pure exploration problems such as best-arm identification, the non-convex constraints that arise in MDPs introduce substantial difficulties. To address them, we reformulate the lower-bound problem by swapping the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. This reformulation admits an interpretation as a policy optimization task in a newly constructed reversed MDP. We further show that the global KL constraint can be decomposed exactly into a family of product-box subproblems, which are solved by projected policy gradient and combined through an outer budget search. Beyond policy testing, our reformulation and reversed MDP view suggest extensions to other pure exploration tasks in MDPs, including policy evaluation and best policy identification.