Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The authors propose a pipeline that links circuit-level analysis to natural-language explanations by identifying causally important attention heads via activation patching, generating explanations with both template-based and LLM-based methods, and evaluating faithfulness with ERASER-style metrics adapted for circuit attribution.
- They evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small, identifying six attention heads that account for 61.4% of the logit difference.
- Circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms across the model's heads.
- LLM-generated explanations outperform template baselines by 64% on quality metrics.
- They report no correlation between model confidence and explanation faithfulness and identify three failure categories where explanations diverge from the underlying mechanisms.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to