The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study systematically tracks full SVD singular-value spectra of every transformer weight matrix at 25-step intervals during pretraining across three scales (30M–285M parameters), revealing how spectral properties evolve over training time.
  • It finds transient compression waves where stable-rank compression propagates as a traveling wave across layers, producing an early peak in gradients that later reverses as deeper layers over-compress relative to earlier ones.
  • It observes persistent spectral gradients: the power-law exponent α forms a depth-dependent inverted-U pattern that shifts toward earlier layers as model depth increases, and this is distinct from the transient compression behavior.
  • The authors report a functional asymmetry in Q/K–V projections, with value/output projections compressing uniformly while query/key projections carry the full depth-dependent dynamics.
  • They formalize the results with a two-timescale dynamical model, derive scaling laws (Δα ∝ L^0.26, R^2=0.99), show α correlates with layer importance (ρ=0.69–0.84), and demonstrate spectral-guided pruning beats Last-N heuristics by 1.1×–3.6× across multiple model families.

Abstract

We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compress past early layers. \textbf{(2)~Persistent Spectral Gradients:} the power-law exponent~\alpha develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf{(3)~Q/K--V Functional Asymmetry:} value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that \emph{rank and spectral shape encode fundamentally different information about training}. We formalize this as a two-timescale dynamical model and derive scaling laws (\Delta\alpha \propto L^{0.26}, R^2{=}0.99). We validate on nine models across three families (custom, GPT-2, Pythia; 30M--1B parameters; 8--36 layers), demonstrate that \alpha predicts layer importance (\rho{=}0.69--0.84, p{<}0.02), and show that spectral-guided pruning outperforms Last-N heuristics by 1.1{\times}--3.6{\times} across seven models in two families (GPT-2 124M--774M, Pythia 160M--1B), with worst-vs-best gaps up to 23.7{\times} confirming the causal role of spectral structure.