Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

arXiv cs.CL / 4/15/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper studies how specific design choices—particularly PDF parsing methods and text/table chunking strategies—affect Retrieval-Augmented Generation (RAG) performance on financial document Question Answering.
It evaluates multiple PDF parsers and chunking approaches with different overlap settings to understand trade-offs in preserving document structure and improving answer correctness.
The study is grounded in financial-domain benchmarks, including TableQuest, a newly generated publicly available benchmark focused on tabular PDF understanding.
The authors aim to provide practical, evidence-based guidelines for building more robust RAG pipelines tailored to heterogeneous PDF content (text, tables, and images).

Abstract

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.