Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

arXiv cs.LG / 5/4/2026

📰 NewsModels & Research

共有:

Key Points

The paper tackles a key computational bottleneck in reinforcement learning by reducing how often RL algorithms must call expensive statistical estimation and planning “oracles.”
For tabular episodic MDPs, it proposes an algorithm that achieves an optimal o(sqrt(T)) regret bound while making only O(H log log T) oracle calls when T is known and O(H log T) calls when T is unknown.
A major contribution is that the required oracle complexity does not depend on the size of the state or action spaces, substantially lowering planning complexity versus prior offline oracle-efficient methods.
The authors extend the framework to linear MDPs with infinite state spaces and arbitrary action spaces, proving sub-linear regret to broaden computational tractability for large/continuous settings.

Abstract

Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-efficient algorithms, their computational complexity typically scales with the cardinality of the state and action spaces, rendering them intractable for large-scale or continuous environments. In this paper, we address this fundamental limitation by studying offline oracle-efficient episodic RL through the lens of log-barrier and log-determinant regularization. Specifically, for tabular Markov Decision Processes (MDPs), we propose a novel algorithm that achieves the optimal

\tilde{O}(\sqrt{T})

regret bound while requiring only

O(H\log\log T)

calls to both the offline statistical estimation and planning oracles when

T

is known and

O(H\log T)

calls when

T

is unknown. Crucially, this oracle complexity is entirely independent of the size of the state and action spaces. This strict independence drastically reduces the planning oracle complexity, representing a substantial improvement over existing offline oracle-efficient algorithms (Qian et al., 2024). Furthermore, we demonstrate the versatility of our framework by generalizing the algorithm to linear MDPs featuring infinite state spaces and arbitrary action spaces. We prove that this generalized approach successfully attains meaningful sub-linear regret. Consequently, our work yields the first doubly oracle-efficient (i.e., efficient with respect to both statistical estimation and policy optimization) regret minimization algorithm capable of solving MDPs with infinite state and action spaces, significantly expanding the boundaries of computationally tractable RL.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B

Reddit r/LocalLLaMA

Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Roundtable chat with Talkie-1930 and Gemma 4 31B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer