Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies stochastic gradient descent (SGD) with momentum and shows it can operate in an Edge of Stochastic Stability (EoSS)-like regime where optimization and learned solutions are shaped near an instability boundary.
  • It finds that “batch sharpness” (expected directional mini-batch curvature) depends strongly on batch size, exhibiting two distinct plateau behaviors rather than following a single momentum-adjusted stability threshold.
  • At small batch sizes, batch sharpness converges to a lower plateau, indicating momentum amplifies stochastic fluctuations and biases training toward flatter regions than vanilla SGD.
  • At large batch sizes, batch sharpness converges to a higher plateau, where momentum returns to its classical stabilizing role and biases solutions toward sharper regions consistent with full-batch dynamics.
  • The authors connect these observations to linear stability thresholds and discuss practical implications for hyperparameter tuning and how momentum and batch size should be coupled.

Abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau 2(1-\beta)/\eta, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau 2(1+\beta)/\eta, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.