MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MVPBench, a new multi-video perception evaluation benchmark aimed at testing multi-modal video understanding beyond single-video or image-only benchmarks.
  • MVPBench contains 14 subtasks across diverse visual domains, with 5K question-answering tests built from 2.7K video clips sourced from existing datasets plus manually annotated clips.
  • The benchmark focuses on evaluating how well models extract relevant information from video sequences to support decision-making.
  • Results from extensive evaluations indicate that current models significantly struggle with multi-video inputs, highlighting major gaps in multi-video comprehension capabilities.
  • The authors position MVPBench as a driver for future advances in multi-video perception research and evaluation.

Abstract

The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.