Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that prior theoretical analyses of actor-critic methods either rely on unrealistic exploration assumptions or use impractical algorithm modifications, especially in linear MDP settings.
  • It proposes an optimistic actor-critic framework for finite-horizon linear Markov decision processes that uses explicitly parameterized log-linear (parametric) policies to improve sampling efficiency.
  • The actor is trained via a tractable logit-matching regression objective, avoiding the computational cost of implicit policies often used with natural policy gradient.
  • For the critic, the method uses approximate Thompson sampling implemented with Langevin Monte Carlo to produce optimistic value estimates.
  • The authors prove sample-complexity bounds of \(\widetilde{O}(\epsilon^{-4})\) (on-policy) and \(\widetilde{O}(\epsilon^{-2})\) (off-policy), claiming state-of-the-art rates while being more practical than prior approaches.

Abstract

Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves \widetilde{\mathcal{O}}(\epsilon^{-4}) and \widetilde{\mathcal{O}}(\epsilon^{-2}) sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.