Narrative Aligned Long Form Video Question Answering

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • NA-VQA introduces a benchmark to evaluate deep temporal and narrative reasoning in long-form videos, addressing limitations of prior benchmarks that rely on localized cues.
  • The dataset contains 88 full-length movies and 4.4K open-ended QA pairs with evidence spans labeled Short, Medium, or Far to assess long-range dependencies.
  • Video-NaRA is proposed as a narrative-centric framework that constructs event-level chains stored in structured memory to support reasoning across scenes.
  • Experiments show state-of-the-art multimodal LLMs struggle with far-range questions, underscoring the need for explicit narrative modeling.
  • The authors report up to a 3 percent improvement in long-range reasoning with Video-NaRA and plan to release NA-VQA upon publication.

Abstract

Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.