Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

arXiv stat.ML / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper develops sharp non-asymptotic error bounds for Q-learning using a general power-law-to-zero learning-rate schedule (PD2Z-ν), extending prior work on LD2Z (linear decay to zero).
  • It derives a central limit theory for a new “tail” Polyak-Ruppert averaging estimator, enabling more refined statistical conclusions about Q-learning performance.
  • The authors also prove a time-uniform Gaussian approximation (strong invariance principle) for partial-sum processes of Q-learning iterates, supporting bootstrap-based inference.
  • Theoretical and numerical results together show that LD2Z—and more broadly PD2Z-ν—can deliver a “best-of-both-worlds” property: fast initial decay like constant step sizes plus asymptotic convergence guarantees like polynomial decay schedules.

Abstract

Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant (\eta_{t}\equiv \eta) or polynomially decaying (\eta_{t} = \eta t^{-\alpha}) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: \eta_{t,n}=\eta(1-t/n)) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}- u: \eta_{t,n}=\eta(1-t/n)^{ u}). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}- u schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}- u achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.