CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models
arXiv cs.CL / 4/14/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current LLM ToM performance often depends on prompt scaffolding and may not generalize to complex, task-specific scenarios, suggesting a mismatch between internal knowledge and external behavior.
- It introduces CoSToM (Causal-oriented Steering for ToM alignment), which combines causal tracing to identify how ToM semantics are represented inside the model with targeted activation steering to intervene directly in ToM-critical layers.
- By mapping the internal feature distributions through causal tracing, the method aims to shift from purely mechanistic interpretation toward active, behavior-stabilizing alignment.
- Experiments reported in the paper indicate that CoSToM improves human-like social reasoning and enhances downstream dialogue quality.
- Overall, the work proposes an approach for “intrinsic cognition” alignment by stabilizing externally observable ToM-like behavior through causal, internal interventions.
Related Articles
langchain-anthropic==1.4.1
LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development
Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs
Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.
Dev.to

The problem with Big Tech AI pricing (and why 8 countries can't afford to compete)
Dev.to