CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

CCTVBench is a new traffic VideoQA benchmark for multimodal LLMs that tests “contrastive consistency” between real accident videos and counterfactual counterparts to ensure models detect true hazards and reject plausible false hypotheses.
The benchmark uses structured decision patterns over video-question quadruples and provides diagnostics that pinpoint specific failure modes, including positive omission, positive swap, negative hallucination, and mutual-exclusivity violations.
Experiments show that models can score well on standard per-instance QA, yet still have a large, persistent gap on quadruple-level contrastive consistency, with poor “none-of-the-above” rejection being a major bottleneck.
The paper proposes C-TCD, a contrastive decoding method that uses a semantically exclusive counterpart video during inference, improving both instance-level QA performance and contrastive consistency.
CCTVBench separates video consistency from question consistency in evaluation, enabling more actionable analysis of where multimodal models fail during safety-critical reasoning.

Abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.