Narrative Aligned Long Form Video Question Answering

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

NA-VQA introduces a benchmark to evaluate deep temporal and narrative reasoning in long-form videos, addressing limitations of prior benchmarks that rely on localized cues.
The dataset contains 88 full-length movies and 4.4K open-ended QA pairs with evidence spans labeled Short, Medium, or Far to assess long-range dependencies.
Video-NaRA is proposed as a narrative-centric framework that constructs event-level chains stored in structured memory to support reasoning across scenes.
Experiments show state-of-the-art multimodal LLMs struggle with far-range questions, underscoring the need for explicit narrative modeling.
The authors report up to a 3 percent improvement in long-range reasoning with Video-NaRA and plan to release NA-VQA upon publication.

Abstract

Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.

How We Built ScholarNet AI: An AI-Powered Study Platform for Students

Dev.to

Using Notion MCP: Building a Personal AI 'OS' to Claim Back Your Morning

Dev.to

The LiteLLM Attack Exposed a Bigger Problem: Your Vibe-Coded App Probably Has the Same Vulnerabilities

Dev.to

Why Your Claude-Assisted Project Falls Apart After Week 3 (And How to Fix It)

Dev.to

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

arXiv cs.CL

Narrative Aligned Long Form Video Question Answering

Key Points

Abstract

Related Articles

How We Built ScholarNet AI: An AI-Powered Study Platform for Students

Using Notion MCP: Building a Personal AI 'OS' to Claim Back Your Morning

The LiteLLM Attack Exposed a Bigger Problem: Your Vibe-Coded App Probably Has the Same Vulnerabilities

Why Your Claude-Assisted Project Falls Apart After Week 3 (And How to Fix It)

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer