SGD at the Edge of Stability: The Stochastic Sharpness Gap

arXiv cs.LG / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how sharpness (the Hessian’s largest eigenvalue) behaves during neural network training, showing that full-batch gradient descent reaches an “Edge of Stability” where sharpness hovers near 2/η.
  • Prior work explained this GD edge behavior via a self-stabilization mechanism tied to third-order loss structure, and the new work extends the framework to mini-batch stochastic gradient descent (SGD).
  • The authors argue that gradient noise in SGD boosts the stabilizing, cubic sharpness-reducing force along the top Hessian eigenvector, pushing the sharpness equilibrium below 2/η.
  • They introduce a stochastic predicted-dynamics approach and prove a coupling theorem that bounds how SGD deviates from these predictions.
  • The resulting closed-form equilibrium gap, ΔS = ηβσ_u^2/(4α), implies that smaller batch sizes produce flatter solutions, while recovering the GD edge when SGD uses the full dataset.

Abstract

When training neural networks with full-batch gradient descent (GD) and step size \eta, the largest eigenvalue of the Hessian -- the sharpness S(\boldsymbol{\theta}) -- rises to 2/\eta and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint S(\boldsymbol{\theta})\leq 2/\eta. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below 2/\eta, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below 2/\eta. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: \Delta S = \eta \beta \sigma_{\boldsymbol{u}}^{2}/(4\alpha), where \alpha is the progressive sharpening rate, \beta is the self-stabilization strength, and \sigma_{ \boldsymbol{u}}^{2} is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.