DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DialToM, a human-verified benchmark for evaluating how well LLMs perform Theory of Mind (ToM) in state-driven dialogue forecasting.
  • It assesses both Literal ToM (predicting mental states) and Functional ToM (using those states to select state-consistent dialogue trajectories) via Prospective Diagnostic Forecasting.
  • Results show an asymmetry in reasoning: models can identify mental states well, but most fail to use that understanding to forecast social dialogue trajectories, with Gemini 3 Pro as a notable exception.
  • The study finds only weak semantic alignment between human and LLM-generated inferences.
  • The authors release the DialToM dataset and evaluation code publicly to support reproducibility.

Abstract

Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.