DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing multimodal LLMs degrade on long-document understanding due to low signal-to-noise (key evidence hidden in irrelevant pages) and weak supervision when training data only provides final short answers.
  • DocSeeker introduces a structured workflow—Analysis, Localization, and Reasoning—to force models to find relevant evidence locations and use them for accurate answers.
  • It uses a two-stage training approach: supervised fine-tuning on high-quality distilled data followed by an evidence-aware policy optimization that jointly improves evidence localization and answer accuracy.
  • To address multi-page memory limits, it proposes an Evidence-Guided Resolution Allocation strategy during training.
  • Experiments report improved performance on both in-domain and out-of-domain tasks, robust generalization to ultra-long documents, and compatibility with visual retrieval-augmented generation systems.

Abstract

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbf{Analysis}, \textbf{Localization} and \textbf{Reasoning}'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.