Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
arXiv cs.AI / 4/21/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that existing LLM interpretability research often analyzes failures only on short prompts or toy setups, leaving a gap for realistic, commonly used benchmarks.
- It proposes “contrastive attribution,” a contrastive, LRP-based method that attributes the logit difference between an incorrect token and a correct alternative to input tokens and internal model states.
- The authors introduce an efficient extension to build cross-layer attribution graphs, enabling analysis for long-context inputs.
- They run a systematic empirical study across multiple benchmarks, comparing how attribution patterns vary by dataset, model size, and training checkpoint.
- The findings indicate token-level contrastive attribution can provide useful signals in certain failure cases, but it is not reliably applicable across all scenarios, showing both value and limitations.
Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.
Reddit r/artificial

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs
The Register

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle
Dev.to

DEEPX and Hyundai Are Building Generative AI Robots
Dev.to

Stop Paying OpenAI to Read Garbage: The Two-Stage Agent Pipeline
Dev.to