Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Video-MME-v2, a new benchmark aimed at measuring real-world robustness and reasoning faithfulness in video understanding as existing benchmarks become saturated.
It uses a progressive tri-level hierarchy that escalates difficulty from visual information aggregation to temporal dynamics modeling and then complex multimodal reasoning.
Instead of simple per-question accuracy, it proposes a group-based non-linear evaluation to enforce consistency and coherent multi-step reasoning, penalizing fragmented or guess-based answers.
The benchmark emphasizes data quality with a controlled human annotation process (12 annotators, 50 independent reviewers, and 3,300 human-hours with up to 5 QA rounds).
Experiments show a sizable performance gap between current best results (e.g., Gemini-3-Pro) and human experts, and identify hierarchical bottlenecks where early aggregation/temporal errors limit later reasoning (including effects of subtitles/textual cues).

Abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.