ADAG: Automatically Describing Attribution Graphs

arXiv cs.CL / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ADAG, an end-to-end, fully automated pipeline to describe attribution graphs used in language model interpretability and circuit tracing without relying on manual inspection of artifacts.
  • ADAG uses “attribution profiles” that quantify a feature’s functional role through input and output gradient effects, providing a more systematic measurement of feature contributions.
  • It proposes a new clustering algorithm to group related features into functional components, aiming to recover coherent subcircuits.
  • An LLM-based explainer–simulator setup then generates and scores natural-language explanations for the roles of these feature groups.
  • The authors report that ADAG recovers interpretable circuits on existing human-analyzed benchmarks and can identify steerable clusters tied to a harmful advice jailbreak in Llama 3.1 8B Instruct.

Abstract

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.