DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
arXiv cs.CL / 4/29/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- DRAGON is a new benchmark designed to evaluate evidence-grounded visual reasoning in diagram question answering (DQA), addressing the gap where VLMs can be accurate without grounding answers in the relevant diagram regions.
- The benchmark requires models to output bounding boxes identifying the visual evidence (e.g., chart elements, labels, legends, axes, connectors) that justify the predicted answer.
- DRAGON includes 11,664 annotated question instances compiled from six existing diagram QA datasets, and it provides a 2,445-instance test set with human-verified evidence annotations.
- The authors assess eight recent vision-language models and analyze how well they localize reasoning evidence across multiple diagram domains, enabling more reliable and interpretable evaluation.
- By standardizing evaluation and providing evidence localization targets, DRAGON aims to support future research focused on models that ground their predictions in visual proof rather than dataset artifacts.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product
Dev.to