From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

arXiv stat.ML / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper provides the first finite-time convergence analysis for last-iterate average-reward Q-learning with an asynchronous implementation, establishing a mean-square convergence rate of O~(1/k) to the optimal Q-function (in the span seminorm).
  • It shows that adaptive stepsizes are crucial: without them, the asynchronous Q-learning update fails to converge to the intended target.
  • By introducing a centering step, the authors further prove pointwise mean-square convergence to the centered optimal Q-function, again achieving an O~(1/k) rate.
  • The work interprets adaptive stepsizes as a kind of implicit importance sampling that helps counteract the destabilizing effects of asynchronous updates, but also turns the method into a non-Markovian stochastic approximation due to strong correlations.
  • To handle these correlations, the authors develop a time-inhomogeneous Markovian reformulation and use time-varying bounds and Markov chain concentration techniques, with tools they expect to be broadly useful for analyzing other adaptive-step-size SA algorithms.

Abstract

This work presents the first finite-time analysis for the last-iterate convergence of average-reward Q-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that, under appropriate assumptions, the iterates generated by this Q-learning algorithm converge at a rate of \tilde{\mathcal{O}}(1/k) (in the mean-square sense) to the optimal Q-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to the centered optimal Q-function, also at a rate of \tilde{\mathcal{O}}(1/k). To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each Q-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes.