Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

arXiv cs.CL / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Doc-$V^*$, an OCR-free, agentic framework for multi-page Document Visual Question Answering that performs sequential evidence aggregation rather than passive retrieval.
Doc-$V^*$ starts from a thumbnail overview, then uses semantic retrieval and targeted page fetching to actively navigate documents and gather only the most relevant pages.
The method maintains structured working memory to aggregate grounded evidence for reasoning, aiming to improve accuracy without scaling costs proportional to document length.
Training uses imitation learning from expert trajectories, followed by optimization with Group Relative Policy Optimization to balance answer quality with evidence-seeking efficiency.
Experiments on five benchmarks show Doc-$V^*$ beating open-source baselines and improving out-of-domain performance by up to 47.9% over a RAG baseline, while additional analyses indicate gains come from better evidence aggregation rather than simply using more pages.

Abstract

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-

V^*

, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-

V^*

begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-

V^*

balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-

V^*

outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9\%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.