Sensitivity-Positional Co-Localization in GQA Transformers

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether, in GQA transformers, the layers most sensitive to task correctness align with the layers where positional encoding (RoPE) adaptation is most impactful, proposing a “co-localization hypothesis.”
  • Experiments on Llama 3.1 8B (32-layer, 4:1 query-to-key-value head ratio) reject co-localization and instead find strong anti-localization, with task-sensitive layers concentrated late (layers 23–31) and RoPE-influential layers early (layers 0–9), yielding Spearman r_s = -0.735 (p = 1.66×10^-6).
  • The authors introduce two methods: LSLORA, which restricts LoRA adaptation only to layers selected by a new correctness-differential hidden-state metric, and GARFA, which adds learnable per-KV-head RoPE-frequency scalar multipliers to targeted layers.
  • A 4-way cross-layer ablation shows that applying both LSLORA and GARFA to the sensitivity-identified layers delivers the best results, improving performance by 4–16 percentage points across six benchmarks and approaching Claude 3.5 Haiku performance on HumanEval+ (67.1% vs. 68.3%) at roughly $100 total compute.

Abstract

We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network (\ell\in\{23\text{-}31\}) while RoPE-influential layers dominate the early network (\ell\in\{0\text{-}9\}), yielding Spearman r_s = -0.735 (p = 1.66\times10^{-6}). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.