VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • VideoStir tackles the challenge of applying multimodal LLMs to long videos, where limited context windows make it hard to use relevant visual evidence end-to-end.
  • The method represents a video as a spatio-temporal graph over clips and uses multi-hop retrieval to gather evidence across temporally distant but contextually related events.
  • VideoStir adds an MLLM-based intent-relevance scorer that retrieves frames by alignment with the query’s reasoning intent, aiming to reduce reliance on brittle explicit semantic matching.
  • To train the intent alignment component, the authors introduce the IR-600K dataset for learning frame-to-query intent relevance.
  • Experiments reportedly show competitive performance versus state-of-the-art baselines without auxiliary information, and the paper provides code/checkpoints.

Abstract

Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.