VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • VidNum-1.4Kは、動画に基づく数値推論(時間的出来事、対象の永続性、合成的ロジック)を検証するための包括的なVideoQAベンチマークとして、1,379件の厳密に人手アノテーションされた動画-質問ペアを提供します。
  • ベンチマークは3段階の階層構造を持ち、単なる視覚知覚から、算術演算・比較・論理推論を時間的証拠に基づいて行う「動画ベースの合成数値推論」へと難度を段階的に引き上げます。
  • 複数のSOTA VLMを評価した結果、Gemini-3.1-proは約60%にかろうじて到達する一方、代表的なオープンソース系は25%〜45%に大きく低迷し、「推論ギャップ」が確認されたと報告しています。
  • 著者らは、現行VLMが安定した「内部ワールドモデル」を欠いている可能性を示唆し、次世代の数値的動画インテリジェンスを診断する難度の高いテストベッドだと位置づけています。

Abstract

Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.