AI Navigate

Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

arXiv cs.AI / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates whether current Large Language Models have Theory of Mind by using an adapted Strange Stories paradigm to test beliefs, intentions, and emotions of story characters.
  • The study tested five LLMs and compared them to human controls, finding a performance gap for earlier and smaller models while GPT-4o showed high accuracy and robustness comparable to humans in challenging conditions.
  • GPT-4o's performance suggests some capacity for mental-state attribution in advanced LLMs, but results do not settle whether this reflects genuine understanding or pattern completion.
  • The authors discuss the implications for the cognitive status of LLMs and emphasize the boundary between genuine understanding and statistical approximation in language models.

Abstract

The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.