A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

arXiv stat.ML / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper provides a finite-time convergence analysis for Q-learning when using time-varying (on-policy) learning policies in discounted MDPs under minimal assumptions, requiring only that some policy yields an irreducible Markov chain over states.
  • It establishes a last-iterate convergence rate for the expected sup-norm error, leading to a sample complexity scaling of order \(\mathcal{O}(1/\xi^2)\) to achieve \(\mathbb{E}[\|Q_k - Q^*\|_\infty] \le \xi\).
  • The authors show the resulting rate matches off-policy Q-learning in \(\xi\)-dependence but has worse dependence on exploration-related parameters, reflecting the trade-offs between on-policy and off-policy sampling.
  • A separate finite-time bound is derived for \(\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]\), making the exploration–exploitation dynamics explicit as \(\pi_k\) evolves toward an optimal policy.
  • To handle the analytical difficulties from time-inhomogeneous Markov noise induced by rapidly time-varying policies, the work develops a Poisson-equation-based decomposition into martingale-difference and residual components, with sensitivity bounds enabling control of the residual terms.

Abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for \mathbb{E}[\|Q_k - Q^*\|_\infty^2], implying a sample complexity of order \mathcal{O}(1/\xi^2) for achieving \mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for \mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2], where \pi_k is the learning policy at iteration k, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.