DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

この論文は、長い画像キャプションにおける“密な（dense）ハルシネーション”を、誤りのある語やスパン単位で正確に特定・局在化できるかを評価するためのベンチマーク DetailVerifyBench を提案しています。
ベンチマークは5つのドメインの計1,000枚の高品質画像で構成され、平均200語超のキャプション長と、複数種類のハルシネーションに対するトークンレベルの詳細注釈を特徴としています。
既存のベンチマークが持つ評価粒度や領域多様性の不足を埋めることを目的としており、長文キャプションでの“局所化精度”をより厳密に測れるとしています。
ベンチマークは公開サイトで利用可能で、MLLM（マルチモーダル大規模言語モデル）の信頼性評価研究を後押しする位置づけです。

Abstract

Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.

Black Hat Asia

AI Business

The enforcement gap: why finding issues was never the problem

Dev.to

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool

Dev.to

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises

Dev.to

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently

Dev.to

DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Key Points

Abstract

Related Articles

Black Hat Asia

The enforcement gap: why finding issues was never the problem

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer