A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

arXiv stat.ML / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper provides a finite-time convergence analysis for Q-learning when using time-varying (on-policy) learning policies in discounted MDPs under minimal assumptions, requiring only that some policy yields an irreducible Markov chain over states.
It establishes a last-iterate convergence rate for the expected sup-norm error, leading to a sample complexity scaling of order \(\mathcal{O}(1/\xi^2)\) to achieve \(\mathbb{E}[\|Q_k - Q^*\|_\infty] \le \xi\).
The authors show the resulting rate matches off-policy Q-learning in \(\xi\)-dependence but has worse dependence on exploration-related parameters, reflecting the trade-offs between on-policy and off-policy sampling.
A separate finite-time bound is derived for \(\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]\), making the exploration–exploitation dynamics explicit as \(\pi_k\) evolves toward an optimal policy.
To handle the analytical difficulties from time-inhomogeneous Markov noise induced by rapidly time-varying policies, the work develops a Poisson-equation-based decomposition into martingale-difference and residual components, with sensitivity bounds enabling control of the residual terms.

Abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for

\mathbb{E}[\|Q_k - Q^*\|_\infty^2]

, implying a sample complexity of order

\mathcal{O}(1/\xi^2)

for achieving

\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi

. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for

\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]

, where

\pi_k

is the learning policy at iteration

k

, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer