Lightweight and Production-Ready PDF Visual Element Parsing

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces a lightweight, production-ready framework to parse PDF visual elements (figures, tables, and forms) with reliable caption-to-element association for better document understanding.
  • It addresses limitations of existing PDF parsers, including missing complex visuals, extracting irrelevant artifacts like watermarks/logos, fragmenting elements, and failing to link captions correctly.
  • Using spatial heuristics, layout analysis, and semantic similarity, the system reports at least 96% visual element detection accuracy and 93% caption association accuracy on benchmarks and internal product data.
  • As a preprocessing step for multimodal RAG, the framework outperforms prior state-of-the-art parsers and large vision-language models on internal data and the MMDocRAG benchmark, while cutting latency by more than 2x.
  • The authors state the system has already been deployed in a challenging production environment, emphasizing practical readiness.

Abstract

PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves \geq96\% visual element detection accuracy and 93\% caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over 2\times. We have deployed the proposed system in challenging production environment.