From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
arXiv cs.AI / 4/8/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper finds that Retrieval-Augmented Generation (RAG) performance is driven more by document preprocessing choices than by the specific PDF-to-Markdown conversion framework used.
- It systematically benchmarks four open-source PDF conversion approaches (Docling, MinerU, Marker, DeepSeek OCR) across 19 pipeline configurations on a 50-question benchmark from 36 Portuguese administrative documents, using LLM-as-judge scoring averaged over 10 runs.
- The best automated accuracy is achieved by Docling with hierarchical splitting and image descriptions (94.1%), outperforming the other conversion frameworks.
- Metadata enrichment and hierarchy-aware chunking improve QA accuracy more than conversion tool selection alone, and font-based hierarchy rebuilding consistently beats LLM-based hierarchy reconstruction.
- An exploratory GraphRAG setup underperforms basic RAG (82%), implying that naive knowledge-graph construction without strong ontological guidance can add complexity without benefit.
Related Articles
[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project
Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing
Dev.to
Google isn’t an AI-first company despite Gemini being great
Reddit r/artificial

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free
Dev.to