DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
arXiv cs.AI / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that existing multimodal LLMs degrade on long-document understanding due to low signal-to-noise (key evidence hidden in irrelevant pages) and weak supervision when training data only provides final short answers.
- DocSeeker introduces a structured workflow—Analysis, Localization, and Reasoning—to force models to find relevant evidence locations and use them for accurate answers.
- It uses a two-stage training approach: supervised fine-tuning on high-quality distilled data followed by an evidence-aware policy optimization that jointly improves evidence localization and answer accuracy.
- To address multi-page memory limits, it proposes an Evidence-Guided Resolution Allocation strategy during training.
- Experiments report improved performance on both in-domain and out-of-domain tasks, robust generalization to ultra-long documents, and compatibility with visual retrieval-augmented generation systems.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway
Dev.to

a claude code/codex plugin to run autoresearch on your repository
Dev.to

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to