Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

arXiv cs.AI / 3/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Gradient Atoms is an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") using dictionary learning in a preconditioned eigenspace.
Among 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors (refusal, arithmetic, yes/no classification, trivia QA) without any behavioral labels.
These atoms also function as steering vectors: applying them as weight-space perturbations yields large, controllable shifts in model behavior (e.g., bulleted-list generation rising from 33% to 94%, systematic refusal dropping from 50% to 0%).
The method requires no query-document scoring stage, scales independently of the number of query behaviors, and code is available at https://github.com/jrosseruk/gradient_atoms.

Abstract

Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Among the 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal, arithmetic, yes/no classification, trivia QA -- without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query--document scoring stage, and scales independently of the number of query behaviors of interest. Code is here: https://github.com/jrosseruk/gradient_atoms

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer