Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that prior theoretical analyses of actor-critic methods either rely on unrealistic exploration assumptions or use impractical algorithm modifications, especially in linear MDP settings.
It proposes an optimistic actor-critic framework for finite-horizon linear Markov decision processes that uses explicitly parameterized log-linear (parametric) policies to improve sampling efficiency.
The actor is trained via a tractable logit-matching regression objective, avoiding the computational cost of implicit policies often used with natural policy gradient.
For the critic, the method uses approximate Thompson sampling implemented with Langevin Monte Carlo to produce optimistic value estimates.
The authors prove sample-complexity bounds of \(\widetilde{O}(\epsilon^{-4})\) (on-policy) and \(\widetilde{O}(\epsilon^{-2})\) (off-policy), claiming state-of-the-art rates while being more practical than prior approaches.

Abstract

Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves

\widetilde{\mathcal{O}}(\epsilon^{-4})

and

\widetilde{\mathcal{O}}(\epsilon^{-2})

sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

Key Points

Abstract

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer