How Far Are Video Models from True Multimodal Reasoning?

arXiv cs.CV / 4/22/2026

💬 OpinionModels & Research

共有:

Key Points

The paper argues that existing video-model benchmarks do not rigorously test whether models achieve true multimodal reasoning because they use oversimplified tasks and fragmented evaluation signals.
It introduces CLVG-Bench, a new evaluation framework with 1,000+ manually annotated examples across 6 categories and 47 subcategories, targeting zero-shot reasoning through context learning in video generation.
It also proposes an Adaptive Video Evaluator (AVE) that matches human expert judgment using minimal annotations and provides interpretable textual feedback for varied video reasoning tasks.
Experiments show that even state-of-the-art video models (e.g., Seedance 2.0) struggle on logically grounded and interactive generation, with success rates under 25% and near 0% respectively, indicating major bottlenecks in multimodal reasoning and physical grounding.
The authors claim the framework and accompanying code offer measurable, actionable diagnostics and a roadmap for building more robust general-purpose video models.

Abstract

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.