SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

arXiv cs.CV / 4/8/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes SVAgent, a storyline-guided cross-modal multi-agent framework designed to improve video question answering by reasoning over coherent narrative progressions rather than only selecting relevant frames.
  • SVAgent uses a storyline agent that incrementally builds a narrative representation from frames suggested by a refinement suggestion agent that targets and learns from historical failure cases.
  • Separate cross-modal decision agents independently predict answers using visual and textual modalities, with their outputs constrained and improved by the evolving storyline representation.
  • A meta-agent evaluates and aligns cross-modal predictions to boost reasoning robustness and improve answer consistency, aiming for more human-like interpretability.
  • Experiments report that SVAgent outperforms existing approaches for VideoQA while providing greater interpretability through its storyline-based reasoning process.

Abstract

Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.