Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study finds that the usual SVD-on-optimizer-update diagnostic can hide the true relationship between SED directions and Linear Centroid Hypothesis (LCH) features, and that switching to SVD on loss gradients changes the measured coupling by 1–2 orders of magnitude.
  • When SVD is applied to loss gradients rather than AdamW updates, the perturbative coupling between SED directions and LCH features increases dramatically (from roughly 3–9× to 100–330×) and largely removes the apparent dependence on the operation type.
  • In a multitask transformer with a shared encoder, update-based SED can falsely suggest diagnostic failure (with measured coupling ≤1×), while per-operation, gradient-based SED recovers strong coupling (about 20–45× across four operations).
  • The authors use causal interventions to show that restricting attention updates to a rank-3 subspace accelerates grokking by ~2.3×, while removing the rank-3 component has little effect under the proposed gradient-projection approach.
  • Overall, SED–LCH coupling is validated as a strong diagnostic for where feature formation concentrates in parameter space, but it is not a single unique causal mechanism because AdamW attention updates are highly rank-redundant under the study’s hyperparameters.

Abstract

We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from \bar{R}_k \approx 3 --9\times to 100--330\times across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives \bar{R}_k \leq 1 -- an apparent failure of the diagnostic -- while per-operation gradient-based SED recovers \bar{R}_k = 20 --45\times across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately 2.3\times across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.