Automated Interpretability and Feature Discovery in Language Models with Agents
arXiv cs.CL / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an autonomous multi-agent framework for mechanistic interpretability that both generates explanations and discovers internal features in large language models.
- It uses two coupled feedback loops: one for refining explanation hypotheses via targeted prompt controls and multi-metric evaluation, and another for discovering features by building an activation-space k-nearest-neighbor graph and filtering candidates with statistical separability and semantic coherence.
- Experiments on the Gemma-2 model family and on MLP neurons in weight-sparse transformer variants show improved results over one-shot automated interpretability methods.
- The approach aims to produce auditable, falsifiable explanation traces and can uncover language-specific and safety-relevant internal features.
Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy
Dev.to

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)
Dev.to

MCP annotations are a UX layer, not a security layer
Dev.to
From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM
Dev.to