DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces DialToM, a human-verified benchmark for evaluating how well LLMs perform Theory of Mind (ToM) in state-driven dialogue forecasting.
It assesses both Literal ToM (predicting mental states) and Functional ToM (using those states to select state-consistent dialogue trajectories) via Prospective Diagnostic Forecasting.
Results show an asymmetry in reasoning: models can identify mental states well, but most fail to use that understanding to forecast social dialogue trajectories, with Gemini 3 Pro as a notable exception.
The study finds only weak semantic alignment between human and LLM-generated inferences.
The authors release the DialToM dataset and evaluation code publicly to support reproducibility.

Abstract

Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

10 AI Tools Every Developer Should Try in 2026

Dev.to

Why use an AI gateway at all?

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

10 AI Tools Every Developer Should Try in 2026

Why use an AI gateway at all?

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer