DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

arXiv cs.CL / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

DRAGON is a new benchmark designed to evaluate evidence-grounded visual reasoning in diagram question answering (DQA), addressing the gap where VLMs can be accurate without grounding answers in the relevant diagram regions.
The benchmark requires models to output bounding boxes identifying the visual evidence (e.g., chart elements, labels, legends, axes, connectors) that justify the predicted answer.
DRAGON includes 11,664 annotated question instances compiled from six existing diagram QA datasets, and it provides a 2,445-instance test set with human-verified evidence annotations.
The authors assess eight recent vision-language models and analyze how well they localize reasoning evidence across multiple diagram domains, enabling more reliable and interpretable evaluation.
By standardizing evaluation and providing evidence localization targets, DRAGON aims to support future research focused on models that ground their predictions in visual proof rather than dataset artifacts.

Abstract

Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

Key Points

Abstract

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer