AI Navigate

Linear Predictability of Attention Heads in Large Language Models

arXiv cs.LG / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a pervasive inter-head linear structure in pretrained Transformers, showing that QKV vectors of an attention head can be reconstructed as a linear combination of a small set of peer heads within the same layer.
  • Across multiple models (Llama-3.1-8B, Falcon3-10B, OLMo-2-7B, Qwen3-32B), 2-5 reference heads recover many target heads with high fidelity, with mean R^2 around 0.76 for Keys on C4 and often >0.85 on GSM8K.
  • The predictability appears to be learned during pretraining rather than architectural and is largely absent at random initialization, rising through checkpoints with a theoretical bound supporting high error at initialization.
  • The work links this emergence to increasing intra-layer alignment of Key projection subspaces.
  • Practically, the authors propose caching only reference-head KV states and reconstructing the rest on the fly, achieving ~2x KV-cache reduction with acceptably small accuracy trade-offs, and showing reconstructing Keys is less harmful than reconstructing Values.

Abstract

Large language model (LLM) inference is increasingly bottlenecked by the Key-Value (KV) cache, yet the fine-grained structure of attention-head activations remains poorly understood. We show that pretrained Transformers exhibit a pervasive inter-head linear structure: for a given token, the Query, Key, and Value (QKV) vectors of an attention head can often be reconstructed as a linear combination of a small number of peer heads, typically within the same layer. Across Llama-3.1-8B, Falcon3-10B, OLMo-2-7B, and Qwen3-32B, just 2-5 reference heads recover many target heads with high fidelity (e.g., mean R^2 approx 0.76 for Keys on C4 with five references, and frequently R^2 > 0.85 on GSM8K). This predictability is learned rather than architectural: it is largely absent at random initialization, rises rapidly during pretraining as we track through OLMo-2 checkpoints, and is supported by a theoretical lower bound showing high mean-squared error for linear prediction at initialization. We further connect this emergence to increasing intra-layer alignment of Key projection subspaces. Finally, we exploit this redundancy for efficiency by caching only reference-head KV states and reconstructing the remaining heads on the fly via lightweight linear maps, achieving 2x KV-cache reduction with model-dependent accuracy trade-offs (4.5-5.5 percentage point average drop on Falcon3-10B and Qwen3-32B across five benchmarks, and larger drops on Llama-3.1-8B), and we find that reconstructing Keys is substantially less harmful than reconstructing Values.