CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

Key Points

  • CCTVBench is a new traffic VideoQA benchmark for multimodal LLMs that tests “contrastive consistency” between real accident videos and counterfactual counterparts to ensure models detect true hazards and reject plausible false hypotheses.
  • The benchmark uses structured decision patterns over video-question quadruples and provides diagnostics that pinpoint specific failure modes, including positive omission, positive swap, negative hallucination, and mutual-exclusivity violations.
  • Experiments show that models can score well on standard per-instance QA, yet still have a large, persistent gap on quadruple-level contrastive consistency, with poor “none-of-the-above” rejection being a major bottleneck.
  • The paper proposes C-TCD, a contrastive decoding method that uses a semantically exclusive counterpart video during inference, improving both instance-level QA performance and contrastive consistency.
  • CCTVBench separates video consistency from question consistency in evaluation, enabling more actionable analysis of where multimodal models fail during safety-critical reasoning.

Abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.