Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
arXiv cs.AI / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper tests prior “emotion circuit” claims in multiple LLMs using clinical-style vignettes that evoke emotion via contextual cues, with emotion keywords removed.
- Using four mechanistic interpretability approaches (linear probing, causal activation patching, knockout experiments, and representational geometry), the authors find two dissociable mechanisms: affect reception and emotion categorization.
- Affect reception remains near-perfect without keywords (AUROC 1.000), suggesting early-layer saturation and keyword-independent detection of emotionally significant content.
- Emotion categorization is partially keyword-dependent, dropping by 1–7% without keywords and improving with model scale, indicating that mapping to specific labels is not fully decoupled from explicit word signals.
- The results falsify a keyword-spotting hypothesis and propose clinical vignette methodology as a more rigorous standard for evaluating emotion-related representations, with implications for AI safety and alignment.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to

I asked my AI agent to design a product launch image. Here's what came back.
Dev.to