MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • MISID is introduced as a new multimodal, multi-turn, multi-participant benchmark dataset aimed at recognizing complex human intent in strategic deception games, addressing limitations of prior single-utterance or simple-dialogue datasets.
  • The dataset includes fine-grained, two-tier multi-dimensional annotations designed for long-context discourse analysis and evidence-based causal tracking across extended interactions.
  • An evaluation of state-of-the-art Multimodal Large Language Models on MISID finds key weaknesses in complex scenarios, including text-prior visual hallucinations, weak cross-modal synergy, and limited ability to chain causal cues.
  • To mitigate these issues, the authors propose FRACTAM, a baseline framework using a “Decouple-Anchor-Reason” approach to reduce text bias, perform two-stage retrieval for long-range factual anchoring, and build explicit cross-modal evidence chains.
  • Experiments report that FRACTAM improves performance on complex strategic tasks, enhancing hidden intent detection/inference while preserving robust perceptual accuracy, and the dataset is publicly available online.

Abstract

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.