Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks
arXiv cs.LG / 4/9/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how stochastic gradient descent (SGD) noise affects saddle-to-saddle training dynamics in deep linear networks (DLNs), an analytically tractable stand-in for deep neural networks.
- It models SGD as stochastic Langevin dynamics with anisotropic, state-dependent noise and—under aligned and balanced weight assumptions—decomposes training into independent one-dimensional per-mode stochastic differential equations.
- The analysis shows that the strongest diffusion in a given mode occurs before the corresponding feature is fully learned, linking SGD noise patterns to the timing of feature learning.
- It derives the stationary distribution per mode, finding that without label noise it matches the gradient-flow stationary behavior, while with label noise it approaches a Boltzmann-like distribution.
- Experiments indicate the qualitative behavior persists even when aligned or balanced weight conditions are not strictly satisfied, suggesting the conclusions are robust to more general setups.
Related Articles

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to

Moving from proof of concept to production: what we learned with Nometria
Dev.to

Frontend Engineers Are Becoming AI Trainers
Dev.to