Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
arXiv cs.CV / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Chain of Evidence (CoE), a retriever-agnostic framework for pixel-level visual attribution in Iterative Retrieval-Augmented Generation (iRAG).
- CoE addresses two limitations of current iRAG systems: coarse-grained text-level citation that forces users to manually search documents, and visual semantic loss caused by converting layouts (slides/PDFs) into plain text.
- Instead of relying on parsed text, CoE uses Vision-Language Models to reason directly over screenshots of retrieved document candidates and outputs precise bounding boxes.
- The authors evaluate CoE on two benchmarks, Wiki-CoE (web pages derived from 2WikiMultiHopQA) and SlideVQA (presentation slides with complex diagrams and layouts).
- Fine-tuned Qwen3-VL-8B-Instruct delivers strong results, outperforming text-based baselines for tasks requiring visual layout understanding, and the code is released on GitHub.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to