The Origin of Edge of Stability

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how full-batch gradient descent on neural networks causes the largest Hessian eigenvalue to converge to a specific threshold, the “Edge of Stability” at 2/η where η is the learning rate.
It proposes a new mathematical object, “edge coupling,” defined over consecutive iterate pairs, whose structure is uniquely determined by the gradient-descent update and whose criticality condition leads directly to the 2/η stability boundary.
By analyzing a step recurrence and a second-order loss-change expansion, the authors show that a telescoping-sum effect forces curvature (i.e., Hessian eigenvalue behavior) toward 2/η without any gap.
The work further characterizes fixed points and period-two orbits by setting both gradients of the edge coupling to zero, reducing the near-fixed-point dynamics to a function of half-amplitude that determines where period-two behavior appears relative to the critical learning rate.
The analysis uses the mean value theorem to bridge different Hessian averages to the true Hessian evaluated at an interior point along the step segment, enabling an exact, not approximate, forcing result.

Abstract

Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold

2/\eta

, where

\eta

is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward

2/\eta

from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary

2/\eta

, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward

2/\eta

. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.