Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

arXiv cs.LG / 5/4/2026

📰 NewsModels & Research

Key Points

  • The paper tackles a key computational bottleneck in reinforcement learning by reducing how often RL algorithms must call expensive statistical estimation and planning “oracles.”
  • For tabular episodic MDPs, it proposes an algorithm that achieves an optimal o(sqrt(T)) regret bound while making only O(H log log T) oracle calls when T is known and O(H log T) calls when T is unknown.
  • A major contribution is that the required oracle complexity does not depend on the size of the state or action spaces, substantially lowering planning complexity versus prior offline oracle-efficient methods.
  • The authors extend the framework to linear MDPs with infinite state spaces and arbitrary action spaces, proving sub-linear regret to broaden computational tractability for large/continuous settings.

Abstract

Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-efficient algorithms, their computational complexity typically scales with the cardinality of the state and action spaces, rendering them intractable for large-scale or continuous environments. In this paper, we address this fundamental limitation by studying offline oracle-efficient episodic RL through the lens of log-barrier and log-determinant regularization. Specifically, for tabular Markov Decision Processes (MDPs), we propose a novel algorithm that achieves the optimal \tilde{O}(\sqrt{T}) regret bound while requiring only O(H\log\log T) calls to both the offline statistical estimation and planning oracles when T is known and O(H\log T) calls when T is unknown. Crucially, this oracle complexity is entirely independent of the size of the state and action spaces. This strict independence drastically reduces the planning oracle complexity, representing a substantial improvement over existing offline oracle-efficient algorithms (Qian et al., 2024). Furthermore, we demonstrate the versatility of our framework by generalizing the algorithm to linear MDPs featuring infinite state spaces and arbitrary action spaces. We prove that this generalized approach successfully attains meaningful sub-linear regret. Consequently, our work yields the first doubly oracle-efficient (i.e., efficient with respect to both statistical estimation and policy optimization) regret minimization algorithm capable of solving MDPs with infinite state and action spaces, significantly expanding the boundaries of computationally tractable RL.