Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
arXiv cs.CL / 4/15/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper studies how specific design choices—particularly PDF parsing methods and text/table chunking strategies—affect Retrieval-Augmented Generation (RAG) performance on financial document Question Answering.
- It evaluates multiple PDF parsers and chunking approaches with different overlap settings to understand trade-offs in preserving document structure and improving answer correctness.
- The study is grounded in financial-domain benchmarks, including TableQuest, a newly generated publicly available benchmark focused on tabular PDF understanding.
- The authors aim to provide practical, evidence-based guidelines for building more robust RAG pipelines tailored to heterogeneous PDF content (text, tables, and images).




