Compressible Softmax-Attended Language under Incompressible Attention

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that in attention heads across multiple Transformer language models (124M–7B parameters, four architecture families), the attention logit energy field is captured by only a small number of singular components (about 2–11 to reach 90% of variance).
It reports that the learned query-key interaction matrix effectively has much lower dimensional complexity than the head dimension would suggest, requiring only 38–75 components to reach the same variance threshold for head sizes d_h = 64 or 128.
The authors observe large spectral gaps (roughly 5–25×), implying the attention computations operate with a significantly reduced effective rank in practice.
Although the softmax attention mechanism distributes capacity uniformly across all head dimensions, real language data concentrates the meaningful interactions into only a few directions, and this “compressibility” is attributed to the data rather than the analyzing framework.

Abstract

Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field

\tilde{E}

reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix

W_Q^\mathrm{T} W_K

needs 38--75 components for the same threshold out of

d_h \in \{64, 128\}

. The spectral gap is

5

25\times

in effective rank. The attention mechanism allocates capacity uniformly across all

d_h

dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.