AIDABench: AI Data Analytics Benchmark

arXiv cs.AI / 3/18/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

AIDABench introduces a comprehensive end-to-end benchmark with 600+ document analytics tasks across three capabilities: question answering, data visualization, and file generation.
Tasks involve realistic, heterogeneous data such as spreadsheets, databases, financial reports, and operational records, spanning diverse industries and job functions.
Evaluations on 11 models (proprietary and open-source) show the best pass-at-1 is 59.43%, underscoring remaining gaps in real-world AI data analytics capabilities.
The paper provides failure mode analyses, identifies key research challenges, and positions AIDABench as a reference for enterprise procurement and model optimization, with the benchmark publicly available on GitHub.

Abstract

As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.