Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

arXiv cs.LG / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies online covariance matrix estimation for Polyak–Ruppert averaged SGD and analyzes the convergence of the existing batch-means estimator in operator norm.
  • By conducting a rigorous per-block bias analysis and re-tuning the block-growth parameter, the authors improve the batch-means rate from O(n^{-(1-α)/4}) to O(n^{-(1-α)/3}), yielding O(n^{-1/6}) at the optimal regime.
  • The proposed modified estimator is Hessian-free (no Hessian access) while preserving O(d^2) memory and includes a full error decomposition into variance, stationarity bias, and nonlinearity bias.
  • The work also introduces a weighted-averaging variant to avoid hard truncation and proves a minimax optimal rate Θ(n^{-(1-α)/2}) using a Le Cam lower bound and a trajectory-regression estimator that matches it by estimating Hessians from SGD trajectory drift.

Abstract

We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of O(n^{-(1-\alpha)/4}), which yields O(n^{-1/8}) at the optimal learning-rate exponent \alpha \rightarrow 1/2^+. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to O(n^{-(1-\alpha)/3}), achieving O(n^{-1/6}). The modified estimator requires no Hessian access and preserves O(d^2) memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate \Theta(n^{-(1-\alpha)/2}) for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives \Omega(n^{-(1-\alpha)/2}), and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves O(n^{-(1-\alpha)/2}), matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.