Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

arXiv cs.AI / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces MuDABench, a benchmark for multi-document analytical question answering over large, semi-structured document collections requiring quantitative analysis and cross-document synthesis.
  • MuDABench is built with distant supervision using document metadata and annotated financial databases, resulting in 80,000+ pages and 332 analytical QA instances.
  • An evaluation protocol is proposed that scores final answer accuracy and also tracks intermediate-fact coverage to diagnose reasoning quality.
  • Experiments show that standard RAG approaches that treat documents as a flat retrieval pool perform poorly on this task.
  • The authors propose a multi-agent workflow (planning, extraction, and code generation) that improves outcomes but still lags behind human experts, with key bottlenecks in extraction accuracy and missing domain knowledge.

Abstract

This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.