Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines whether safety fine-tuning that reduces harmful mind-attribution in LLMs also impairs related socio-cognitive abilities like Theory of Mind (ToM).
Using safety ablation and mechanistic/representational similarity analyses, the authors find that self-directed or artifact-directed mind-attributions are dissociable from ToM capabilities in both behavioral and mechanistic terms.
The results suggest that safety fine-tuned models do not necessarily lose ToM competence, even as they change how they attribute mental states.
However, the study also reports a safety-fine-tuning-induced bias toward under-attributing minds to non-human animals compared with human baselines and a reduced tendency to display “spiritual belief.”

Abstract

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.