FCMBench-Video: Benchmarking Document Video Intelligence

arXiv cs.CV / 4/29/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces FCMBench-Video, a new benchmark focused on document video intelligence for financial use cases where accuracy and evidence traceability are critical (e.g., credit review and remote verification).
Unlike static images, document videos add temporal, sequential evidence that must be integrated across frames while retaining authenticity-relevant acquisition cues.
The benchmark is constructed for privacy-compliant but realistic scaling by recording reusable atomic single-document clips, applying controlled degradations, and composing long-form multi-document videos with specified temporal spans.
FCMBench-Video includes 495 atomic videos that are composed into 1,200 long-form videos, with 11,322 expert-annotated QA instances across 28 document types and both Chinese and English questions.
Tests on nine recent Video-MLLMs suggest the benchmark meaningfully differentiates systems and capabilities, identifying which tasks are most duration-sensitive and which probe higher-level evidence integration and robustness (e.g., visual prompt injection).

Abstract

Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

LLMs will be a commodity

Reddit r/artificial

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

FCMBench-Video: Benchmarking Document Video Intelligence

Key Points

Abstract

Related Articles

LLMs will be a commodity

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer