Sensitivity-Positional Co-Localization in GQA Transformers

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether, in GQA transformers, the layers most sensitive to task correctness align with the layers where positional encoding (RoPE) adaptation is most impactful, proposing a “co-localization hypothesis.”
Experiments on Llama 3.1 8B (32-layer, 4:1 query-to-key-value head ratio) reject co-localization and instead find strong anti-localization, with task-sensitive layers concentrated late (layers 23–31) and RoPE-influential layers early (layers 0–9), yielding Spearman r_s = -0.735 (p = 1.66×10^-6).
The authors introduce two methods: LSLORA, which restricts LoRA adaptation only to layers selected by a new correctness-differential hidden-state metric, and GARFA, which adds learnable per-KV-head RoPE-frequency scalar multipliers to targeted layers.
A 4-way cross-layer ablation shows that applying both LSLORA and GARFA to the sensitivity-identified layers delivers the best results, improving performance by 4–16 percentage points across six benchmarks and approaching Claude 3.5 Haiku performance on HumanEval+ (67.1% vs. 68.3%) at roughly $100 total compute.

Abstract

We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network (

\ell\in\{23\text{-}31\}

) while RoPE-influential layers dominate the early network (

\ell\in\{0\text{-}9\}

), yielding Spearman

r_s = -0.735

(

p = 1.66\times10^{-6}

). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

Sensitivity-Positional Co-Localization in GQA Transformers

Key Points

Abstract

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer