ADAG: Automatically Describing Attribution Graphs
arXiv cs.CL / 4/10/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ADAG, an end-to-end, fully automated pipeline to describe attribution graphs used in language model interpretability and circuit tracing without relying on manual inspection of artifacts.
- ADAG uses “attribution profiles” that quantify a feature’s functional role through input and output gradient effects, providing a more systematic measurement of feature contributions.
- It proposes a new clustering algorithm to group related features into functional components, aiming to recover coherent subcircuits.
- An LLM-based explainer–simulator setup then generates and scores natural-language explanations for the roles of these feature groups.
- The authors report that ADAG recovers interpretable circuits on existing human-analyzed benchmarks and can identify steerable clusters tied to a harmful advice jailbreak in Llama 3.1 8B Instruct.



