Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

arXiv cs.LG / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies online covariance matrix estimation for Polyak–Ruppert averaged SGD and analyzes the convergence of the existing batch-means estimator in operator norm.
By conducting a rigorous per-block bias analysis and re-tuning the block-growth parameter, the authors improve the batch-means rate from O(n^{-(1-α)/4}) to O(n^{-(1-α)/3}), yielding O(n^{-1/6}) at the optimal regime.
The proposed modified estimator is Hessian-free (no Hessian access) while preserving O(d^2) memory and includes a full error decomposition into variance, stationarity bias, and nonlinearity bias.
The work also introduces a weighted-averaging variant to avoid hard truncation and proves a minimax optimal rate Θ(n^{-(1-α)/2}) using a Le Cam lower bound and a trajectory-regression estimator that matches it by estimating Hessians from SGD trajectory drift.

Abstract

We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of

O(n^{-(1-\alpha)/4})

, which yields

O(n^{-1/8})

at the optimal learning-rate exponent

\alpha \rightarrow 1/2^+

. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to

O(n^{-(1-\alpha)/3})

, achieving

O(n^{-1/6})

. The modified estimator requires no Hessian access and preserves

O(d^2)

memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate

\Theta(n^{-(1-\alpha)/2})

for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives

\Omega(n^{-(1-\alpha)/2})

, and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves

O(n^{-(1-\alpha)/2})

, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.