CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
arXiv cs.CV / 4/23/2026
📰 NewsModels & Research
Key Points
- CCTVBench is a new traffic VideoQA benchmark for multimodal LLMs that tests “contrastive consistency” between real accident videos and counterfactual counterparts to ensure models detect true hazards and reject plausible false hypotheses.
- The benchmark uses structured decision patterns over video-question quadruples and provides diagnostics that pinpoint specific failure modes, including positive omission, positive swap, negative hallucination, and mutual-exclusivity violations.
- Experiments show that models can score well on standard per-instance QA, yet still have a large, persistent gap on quadruple-level contrastive consistency, with poor “none-of-the-above” rejection being a major bottleneck.
- The paper proposes C-TCD, a contrastive decoding method that uses a semantically exclusive counterpart video during inference, improving both instance-level QA performance and contrastive consistency.
- CCTVBench separates video consistency from question consistency in evaluation, enabling more actionable analysis of where multimodal models fail during safety-critical reasoning.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

GPT Image 2 vs DALL-E 3: What Actually Changed in OpenAI's New Image Model
Dev.to

AI Tutor for Science Students — Physics Chemistry Biology Solved by AI
Dev.to