AI Navigate

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper provides a quantitative theory of the catapult phase in SGD training of a shallow network under NTK scaling.
  • It identifies an explicit criterion G, depending on the kernel, learning rate, and data, that separates two regimes of behavior: G>0 yields large NTK-flattening spikes with high probability, while G<0 leads to a spike-probability decay ~ (n/eta)^{-vartheta/2} with vartheta in (0, ∞).
  • This yields a concrete, parameter-dependent explanation for why such spikes can still be observed at practical network widths.
  • The analysis employs a large-deviations viewpoint to characterize spike probabilities and relate kernel dynamics to training hyperparameters.

Abstract

We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function G, depending only on the kernel, learning rate \eta and data, is positive, SGD produces large NTK-flattening spikes with high probability; when G<0, their probability decays like (n/\eta)^{-\vartheta/2}, for an explicitly characterised \vartheta\in (0,\infty). This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.