AI Navigate

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

arXiv cs.AI / 3/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Spectral Edge Dynamics (SED), a rolling-window SVD-based method to identify a sharp spectral edge that separates coherent optimization directions from stochastic noise in training trajectories.
  • Experiments on a 51M-parameter TinyStories model and GPT-2 124M under distribution shift reveal a universal three-phase pattern in the spectral edge: rise, plateau, and collapse.
  • The effective signal rank k* scales with task complexity (k* = 2 for 51M and k* = 3 for 124M), indicating how many directions dominate the training dynamics.
  • The coupling between spectral geometry and validation loss can reverse with window size, a lag flip that reflects the timescale of trajectory integration.
  • A Johnson–Lindenstrauss projection to d = 10W dimensions preserves the spectral gap within about 5.7%, enabling the framework to scale to models of arbitrary size; companion work uses this geometry to forecast grokking 600–1,700 steps in advance across several tasks.

Abstract

Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio \sigma_k/\sigma_{k+1}. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity (k^* = 2 at 51M, k^* = 3 at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to d = 10W dimensions (e.g., d = 100 for W = 10) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.