AI Navigate

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

arXiv cs.AI / 3/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Gradient Atoms is an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") using dictionary learning in a preconditioned eigenspace.
  • Among 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors (refusal, arithmetic, yes/no classification, trivia QA) without any behavioral labels.
  • These atoms also function as steering vectors: applying them as weight-space perturbations yields large, controllable shifts in model behavior (e.g., bulleted-list generation rising from 33% to 94%, systematic refusal dropping from 50% to 0%).
  • The method requires no query-document scoring stage, scales independently of the number of query behaviors, and code is available at https://github.com/jrosseruk/gradient_atoms.

Abstract

Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Among the 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal, arithmetic, yes/no classification, trivia QA -- without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query--document scoring stage, and scales independently of the number of query behaviors of interest. Code is here: https://github.com/jrosseruk/gradient_atoms