Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper develops an online reinforcement learning algorithm for infinite-horizon MDPs that targets two objectives—average-reward regret and discounted ($$)-regret—using a single tractable UCB-style approach.
It proves first-of-their-kind optimal variance-dependent regret bounds of the form O~(sqrt(SAVar) + lower-order terms), showing the method can adapt to easier instances such as deterministic MDPs with nearly constant regret.
For the average-reward setting, the algorithm’s lower-order terms are significantly improved when the learner has prior knowledge of the optimal bias span h* _sp, achieving terms scaling as h* _sp S^2 A and matching optimality claims in both h* _sp and the action dimension A.
Without prior knowledge, the authors establish an impossibility result showing no method can have lower-order terms smaller than h* _sp^2 S A, and propose a prior-free algorithm whose lower-order terms scale as h* _sp^2 S^3 A, nearly matching the lower bound and characterizing the achievable gap.
Overall, the work provides a near-complete characterization of optimal dependence on the bias span h* _sp in both leading and lower-order regret terms, and highlights a fundamental difference between informed vs. prior-free learning.

Abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the

\gamma

-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form

\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})

, where

S,A

are the state and action space sizes, and

\text{Var}

captures cumulative transition variance. This implies minimax-optimal average-reward and

\gamma

-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span

\Vert h^\star\Vert_\text{sp}

, our algorithm obtains lower-order terms scaling as

\Vert h^\star\Vert_\text{sp} S^2 A

, which we prove is optimal in both

\Vert h^\star\Vert_\text{sp}

and

A

. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than

\Vert h^\star \Vert_\text{sp}^2 S A

, and we provide a prior-free algorithm whose lower-order terms scale as

\Vert h^\star\Vert_\text{sp}^2 S^3 A

, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on

\Vert h^\star\Vert_\text{sp}

in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

I asked my AI agent to design a product launch image. Here's what came back.

Dev.to

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

I asked my AI agent to design a product launch image. Here's what came back.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer