AI Navigate

CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

arXiv cs.CL / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • CoMMET is a new multimodal benchmark dataset designed to evaluate Theory of Mind in LLMs, extending assessment beyond text inputs.
  • It introduces multi-turn testing and is inspired by the Theory of Mind Booklet Task, reportedly the first multimodal ToM benchmark of its kind.
  • The study evaluates multiple LLM families and sizes to analyze strengths and limitations and to identify directions for future improvement.
  • By probing social cognitive abilities, CoMMET aims to enable more natural and effective human-AI interactions.
  • This release provides a new resource for the AI research community to benchmark ToM-related performance across modalities and conversational turns.

Abstract

Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.