Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ProVCA, an agent for efficient long-form video understanding that reduces the compute and frame cost of video-based multimodal LLM (MLLM) reasoning.
  • ProVCA works progressively by first localizing the query-relevant video segment, then selecting important snippets via similarity, and finally refining to specific keyframes for targeted MLLM processing.
  • The authors argue that prior text-then-LLM approaches can miss fine-grained visual cues, while direct video-MLLM pipelines are too frame-hungry, motivating the condensation strategy.
  • ProVCA reports state-of-the-art zero-shot accuracies of 69.3% on EgoSchema, 80.5% on NExT-QA, and 77.7% on IntentQA while using fewer frames than existing training-free methods.

Abstract

Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.