Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study finds that the usual SVD-on-optimizer-update diagnostic can hide the true relationship between SED directions and Linear Centroid Hypothesis (LCH) features, and that switching to SVD on loss gradients changes the measured coupling by 1–2 orders of magnitude.
When SVD is applied to loss gradients rather than AdamW updates, the perturbative coupling between SED directions and LCH features increases dramatically (from roughly 3–9× to 100–330×) and largely removes the apparent dependence on the operation type.
In a multitask transformer with a shared encoder, update-based SED can falsely suggest diagnostic failure (with measured coupling ≤1×), while per-operation, gradient-based SED recovers strong coupling (about 20–45× across four operations).
The authors use causal interventions to show that restricting attention updates to a rank-3 subspace accelerates grokking by ~2.3×, while removing the rank-3 component has little effect under the proposed gradient-projection approach.
Overall, SED–LCH coupling is validated as a strong diagnostic for where feature formation concentrates in parameter space, but it is not a single unique causal mechanism because AdamW attention updates are highly rank-redundant under the study’s hyperparameters.

Abstract

We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from

\bar{R}_k \approx 3

9\times

100

330\times

across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives

\bar{R}_k \leq 1

-- an apparent failure of the diagnostic -- while per-operation gradient-based SED recovers

\bar{R}_k = 20

45\times

across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately

2.3\times

across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

Key Points

Abstract

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer