StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

arXiv cs.AI / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that existing video moment retrieval models struggle with narrative content because they can identify “what is happening” but not infer “why it matters,” due to a missing Theory of Mind (ToM) component.
It introduces StoryTR, a new benchmark for narrative short-form video retrieval that explicitly requires ToM-style reasoning, with 8.1k samples designed to test subtle multimodal cues and implied mental states.
The authors propose an Agentic Data Pipeline that generates training data with structured three-tier ToM reasoning chains, covering intent decoding, narrative reasoning, and boundary localization.
Experiments show a large reasoning gap: Gemini-3.0-Pro reaches only 0.53 Avg IoU on StoryTR, while the 7B Shorts-Moment model trained with ToM-guided data improves IoU by 15.1% relative to baselines, suggesting reasoning quality can outweigh sheer parameter count.

Abstract

Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be ``concealing hostility.'' To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1\% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Voice Agents in Production: What Actually Works in 2026

Dev.to

How we built a browser-based AI Pathology platform

Dev.to

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Voice Agents in Production: What Actually Works in 2026

How we built a browser-based AI Pathology platform

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer