Abstract
Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio \sigma_k/\sigma_{k+1}. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity (k^* = 2 at 51M, k^* = 3 at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to d = 10W dimensions (e.g., d = 100 for W = 10) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.