Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
arXiv cs.AI / 4/27/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces MuDABench, a benchmark for multi-document analytical question answering over large, semi-structured document collections requiring quantitative analysis and cross-document synthesis.
- MuDABench is built with distant supervision using document metadata and annotated financial databases, resulting in 80,000+ pages and 332 analytical QA instances.
- An evaluation protocol is proposed that scores final answer accuracy and also tracks intermediate-fact coverage to diagnose reasoning quality.
- Experiments show that standard RAG approaches that treat documents as a flat retrieval pool perform poorly on this task.
- The authors propose a multi-agent workflow (planning, extraction, and code generation) that improves outcomes but still lags behind human experts, with key bottlenecks in extraction accuracy and missing domain knowledge.




