MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MISID is introduced as a new multimodal, multi-turn, multi-participant benchmark dataset aimed at recognizing complex human intent in strategic deception games, addressing limitations of prior single-utterance or simple-dialogue datasets.
The dataset includes fine-grained, two-tier multi-dimensional annotations designed for long-context discourse analysis and evidence-based causal tracking across extended interactions.
An evaluation of state-of-the-art Multimodal Large Language Models on MISID finds key weaknesses in complex scenarios, including text-prior visual hallucinations, weak cross-modal synergy, and limited ability to chain causal cues.
To mitigate these issues, the authors propose FRACTAM, a baseline framework using a “Decouple-Anchor-Reason” approach to reduce text bias, perform two-stage retrieval for long-range factual anchoring, and build explicit cross-modal evidence chains.
Experiments report that FRACTAM improves performance on complex strategic tasks, enhancing hidden intent detection/inference while preserving robust perceptual accuracy, and the dataset is publicly available online.

Abstract

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Key Points

Abstract

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer