VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • VEBench is introduced as a new, comprehensive benchmark to evaluate large multimodal models (LMMs) for real-world video editing, focusing on both editing knowledge understanding and operational multimodal reasoning.
  • The benchmark includes 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs created via a three-round human-AI collaborative annotation pipeline for precise temporal labeling and semantic consistency.
  • It provides two complementary tasks: recognizing specific video editing techniques from multimodal cues and simulating real editing operations by selecting and temporally localizing relevant clips from multiple candidates.
  • Experiments on both proprietary and open-source LMMs show a significant performance gap versus human-level editing cognition, underscoring the need to bridge video understanding with creative workflow reasoning.
  • The authors position VEBench as a foundation dataset for building more capable intelligent video editing systems and for driving future research on complex reasoning in multimodal settings.

Abstract

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.