The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a “spectral edge thesis” stating that key phase-transition behaviors in neural network training—like grokking, capability gains, and loss plateaus—are governed by the spectral gap of a rolling-window Gram matrix of parameter updates.
In an extreme aspect-ratio setting, it argues the usual BBP detection threshold no longer applies, and instead a specific intra-signal gap position (k*) defined via the dominant-to-subdominant singular value ratio uniquely drives learning.
The authors derive a gap-dynamics model via a Dyson-type ODE, plus a spectral decomposition linking learning contributions to Davis–Kahan stability, and introduce a “Gap Maximality Principle” where only the collapse of the privileged k* disrupts learning.
They define an adiabatic stability parameter (𝓐 = ||ΔG||_F / (η g^2)) to classify training regimes: stable plateaus (𝓐≪1), phase transitions (𝓐~1), and forgetting (𝓐≫1).
Experiments across six model families (150K–124M parameters) report that gap dynamics precede grokking (with and without weight decay), the privileged gap position depends on the optimizer (e.g., Muon vs AdamW), and 19 out of 20 quantitative predictions are confirmed.

Abstract

We develop the spectral edge thesis: phase transitions in neural network training -- grokking, capability gains, loss plateaus -- are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters

P \sim 10^8

, window

W \sim 10

), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position

k^* = \mathrm{argmax}\, \sigma_j/\sigma_{j+1}

. From three axioms we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode's learning contribution to its Davis--Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that

k^*

is the unique dynamically privileged position -- its collapse is the only one that disrupts learning, and it sustains itself through an

\alpha

-feedback loop requiring no assumption on the optimizer. The adiabatic parameter

\mathcal{A} = \|\Delta G\|_F / (\eta\, g^2)

controls circuit stability:

\mathcal{A} \ll 1

(plateau),

\mathcal{A} \sim 1

(phase transition),

\mathcal{A} \gg 1

(forgetting). Tested across six model families (150K--124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 0/24 without), the gap position is optimizer-dependent (Muon:

k^*=1

, AdamW:

k^*=2

on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws.