INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

arXiv cs.CV / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

INFACT introduces a diagnostic benchmark with 9,800 QA instances spanning real and synthetic videos to evaluate faithfulness and factuality in Video-LLMs.
It evaluates models under four induced modes—Base, Visual Degradation, Evidence Corruption, and Temporal Intervention—and uses Resist Rate (RR) and Temporal Sensitivity Score (TSS) to quantify reliability.
Experiments on 14 representative Video-LLMs show that higher Base-mode accuracy does not reliably predict higher reliability in induced modes, with evidence corruption reducing stability and temporal intervention causing the largest degradation.
The results reveal pronounced temporal inertia among open-source baselines, with near-zero TSS on factuality for order-sensitive questions.
By stressing models with real and synthetic videos and induced perturbations, INFACT highlights gaps between nominal accuracy and reliability in temporally sensitive settings.

Abstract

Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Dev.to

How I built a 4-product AI income stack in 4 months (the honest version)

Dev.to

I stopped writing AI prompts from scratch. Here is the system I built instead.

Dev.to

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Key Points

Abstract

Related Articles

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

How I built a 4-product AI income stack in 4 months (the honest version)

I stopped writing AI prompts from scratch. Here is the system I built instead.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer