INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
arXiv cs.CV / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- INFACT introduces a diagnostic benchmark with 9,800 QA instances spanning real and synthetic videos to evaluate faithfulness and factuality in Video-LLMs.
- It evaluates models under four induced modes—Base, Visual Degradation, Evidence Corruption, and Temporal Intervention—and uses Resist Rate (RR) and Temporal Sensitivity Score (TSS) to quantify reliability.
- Experiments on 14 representative Video-LLMs show that higher Base-mode accuracy does not reliably predict higher reliability in induced modes, with evidence corruption reducing stability and temporal intervention causing the largest degradation.
- The results reveal pronounced temporal inertia among open-source baselines, with near-zero TSS on factuality for order-sensitive questions.
- By stressing models with real and synthetic videos and induced perturbations, INFACT highlights gaps between nominal accuracy and reliability in temporally sensitive settings.




