Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

arXiv stat.ML / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes how gradient descent initially moves away from the “origin saddle” in parameter space for deep ReLU networks initialized with small weights.
  • It characterizes the so-called “escape directions,” which function similarly to Hessian eigenvectors for strict saddles in determining how GD leaves the saddle.
  • The main theoretical result shows that the optimal escape direction exhibits a low-rank bias in deeper layers, quantified by a lower bound separating the top singular value from the others.
  • The authors further prove related properties of these escape directions and argue this supports a saddle-to-saddle dynamic where GD transitions through a sequence of saddles with increasing bottleneck rank.
  • Overall, the work provides a mechanistic explanation linking early optimization behavior to structured (low-rank) representations emerging across layers.

Abstract

When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the \ell-th layer weight matrix is at least \ell^{\frac{1}{4}} larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank (Jacot, 2023).