Linear Predictability of Attention Heads in Large Language Models
arXiv cs.LG / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a pervasive inter-head linear structure in pretrained Transformers, showing that QKV vectors of an attention head can be reconstructed as a linear combination of a small set of peer heads within the same layer.
- Across multiple models (Llama-3.1-8B, Falcon3-10B, OLMo-2-7B, Qwen3-32B), 2-5 reference heads recover many target heads with high fidelity, with mean R^2 around 0.76 for Keys on C4 and often >0.85 on GSM8K.
- The predictability appears to be learned during pretraining rather than architectural and is largely absent at random initialization, rising through checkpoints with a theoretical bound supporting high error at initialization.
- The work links this emergence to increasing intra-layer alignment of Key projection subspaces.
- Practically, the authors propose caching only reference-head KV states and reconstructing the rest on the fly, achieving ~2x KV-cache reduction with acceptably small accuracy trade-offs, and showing reconstructing Keys is less harmful than reconstructing Values.




