Abstract
Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field \tilde{E} reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix W_Q^\mathrm{T} W_K needs 38--75 components for the same threshold out of d_h \in \{64, 128\}. The spectral gap is 5--25\times in effective rank. The attention mechanism allocates capacity uniformly across all d_h dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.