A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
arXiv cs.CV / 4/30/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The study addresses structured information extraction from long, noisy, multilingual scanned financial documents in real-world industrial KYC/compliance workflows where end-to-end VLMs can be unreliable.
- It proposes a multistage pipeline that combines image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction, explicitly separating page localization from multimodal reasoning.
- Experiments on 120 production KYC documents (about 3,000 pages) show the pipeline beats direct PDF-to-VLM baselines across multiple OCR–VLM combinations, with up to a 31.9 percentage-point improvement in field-level accuracy.
- The best-performing setup uses PaddleOCR with MiniCPM2.6, reaching 87.27% accuracy, and ablations indicate that page-level retrieval is the main driver of gains, especially for complex and non-English statements.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to