The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study systematically tracks full SVD singular-value spectra of every transformer weight matrix at 25-step intervals during pretraining across three scales (30M–285M parameters), revealing how spectral properties evolve over training time.
It finds transient compression waves where stable-rank compression propagates as a traveling wave across layers, producing an early peak in gradients that later reverses as deeper layers over-compress relative to earlier ones.
It observes persistent spectral gradients: the power-law exponent α forms a depth-dependent inverted-U pattern that shifts toward earlier layers as model depth increases, and this is distinct from the transient compression behavior.
The authors report a functional asymmetry in Q/K–V projections, with value/output projections compressing uniformly while query/key projections carry the full depth-dependent dynamics.
They formalize the results with a two-timescale dynamical model, derive scaling laws (Δα ∝ L^0.26, R^2=0.99), show α correlates with layer importance (ρ=0.69–0.84), and demonstrate spectral-guided pruning beats Last-N heuristics by 1.1×–3.6× across multiple model families.

Abstract

We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compress past early layers. \textbf{(2)~Persistent Spectral Gradients:} the power-law exponent~

\alpha

develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf{(3)~Q/K--V Functional Asymmetry:} value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that \emph{rank and spectral shape encode fundamentally different information about training}. We formalize this as a two-timescale dynamical model and derive scaling laws (

\Delta\alpha \propto L^{0.26}

R^2{=}0.99

). We validate on nine models across three families (custom, GPT-2, Pythia; 30M--1B parameters; 8--36 layers), demonstrate that

\alpha

predicts layer importance (

\rho{=}0.69

0.84

p{<}0.02

), and show that spectral-guided pruning outperforms Last-N heuristics by

1.1{\times}

3.6{\times}

across seven models in two families (GPT-2 124M--774M, Pythia 160M--1B), with worst-vs-best gaps up to

23.7{\times}

confirming the causal role of spectral structure.

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

Dev.to

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

Key Points

Abstract

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer