StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
arXiv cs.AI / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper argues that existing video moment retrieval models struggle with narrative content because they can identify “what is happening” but not infer “why it matters,” due to a missing Theory of Mind (ToM) component.
- It introduces StoryTR, a new benchmark for narrative short-form video retrieval that explicitly requires ToM-style reasoning, with 8.1k samples designed to test subtle multimodal cues and implied mental states.
- The authors propose an Agentic Data Pipeline that generates training data with structured three-tier ToM reasoning chains, covering intent decoding, narrative reasoning, and boundary localization.
- Experiments show a large reasoning gap: Gemini-3.0-Pro reaches only 0.53 Avg IoU on StoryTR, while the 7B Shorts-Moment model trained with ToM-guided data improves IoU by 15.1% relative to baselines, suggesting reasoning quality can outweigh sheer parameter count.
Related Articles
LLMs will be a commodity
Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu
AI Voice Agents in Production: What Actually Works in 2026
Dev.to
How we built a browser-based AI Pathology platform
Dev.to