Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

arXiv cs.LG / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how stochastic gradient descent (SGD) noise affects saddle-to-saddle training dynamics in deep linear networks (DLNs), an analytically tractable stand-in for deep neural networks.
  • It models SGD as stochastic Langevin dynamics with anisotropic, state-dependent noise and—under aligned and balanced weight assumptions—decomposes training into independent one-dimensional per-mode stochastic differential equations.
  • The analysis shows that the strongest diffusion in a given mode occurs before the corresponding feature is fully learned, linking SGD noise patterns to the timing of feature learning.
  • It derives the stationary distribution per mode, finding that without label noise it matches the gradient-flow stationary behavior, while with label noise it approaches a Boltzmann-like distribution.
  • Experiments indicate the qualitative behavior persists even when aligned or balanced weight conditions are not strictly satisfied, suggesting the conclusions are robust to more general setups.

Abstract

Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.