Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv cs.CV / 4/8/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Video-MME-v2, a new benchmark aimed at measuring real-world robustness and reasoning faithfulness in video understanding as existing benchmarks become saturated.
- It uses a progressive tri-level hierarchy that escalates difficulty from visual information aggregation to temporal dynamics modeling and then complex multimodal reasoning.
- Instead of simple per-question accuracy, it proposes a group-based non-linear evaluation to enforce consistency and coherent multi-step reasoning, penalizing fragmented or guess-based answers.
- The benchmark emphasizes data quality with a controlled human annotation process (12 annotators, 50 independent reviewers, and 3,300 human-hours with up to 5 QA rounds).
- Experiments show a sizable performance gap between current best results (e.g., Gemini-3-Pro) and human experts, and identify hierarchical bottlenecks where early aggregation/temporal errors limit later reasoning (including effects of subtitles/textual cues).




