Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies reinforcement learning in low-rank MDPs with function approximation, aiming to clarify which common RL “oracles” are computationally feasible versus intractable.
It establishes a hierarchy showing policy evaluation is the most computationally efficient oracle when supervised learning can be solved efficiently.
Building on this, the authors propose an optimistic actor-critic method that uses only the policy evaluation oracle, avoiding computationally expensive planning/optimization oracles.
The method achieves improved sample-complexity guarantees for low-rank MDPs and extends the theory to approximately low-rank MDPs, arguing this model covers many real-world settings.
Experiments on several standard OpenAI Gym environments are used to validate the theoretical results.

Abstract

Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym environments.