Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

arXiv stat.ML / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper develops sharp non-asymptotic error bounds for Q-learning using a general power-law-to-zero learning-rate schedule (PD2Z-ν), extending prior work on LD2Z (linear decay to zero).
It derives a central limit theory for a new “tail” Polyak-Ruppert averaging estimator, enabling more refined statistical conclusions about Q-learning performance.
The authors also prove a time-uniform Gaussian approximation (strong invariance principle) for partial-sum processes of Q-learning iterates, supporting bootstrap-based inference.
Theoretical and numerical results together show that LD2Z—and more broadly PD2Z-ν—can deliver a “best-of-both-worlds” property: fast initial decay like constant step sizes plus asymptotic convergence guarantees like polynomial decay schedules.

Abstract

Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant (

\eta_{t}\equiv \eta

) or polynomially decaying (

\eta_{t} = \eta t^{-\alpha}

) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}:

\eta_{t,n}=\eta(1-t/n)

) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-

u

\eta_{t,n}=\eta(1-t/n)^{ u}

). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}-

u

schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}-

u

achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer