DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

arXiv cs.AI / 4/17/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

DR$^{3}$-Eval is introduced as a realistic, reproducible benchmark to evaluate Deep Research Agents, especially for multimodal, multi-file report generation in complex research settings.
The benchmark is built from authentic user-provided materials and a static per-task research sandbox corpus that mimics open-web complexity while staying fully verifiable.
The evaluation framework uses multiple dimensions—Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality—and is validated against human judgments.
Experiments with DR$^{3}$-Agent (using multiple state-of-the-art language models) show the benchmark is highly challenging and exposes key failure modes, including retrieval robustness issues and hallucination control.
The authors state that the code and data are publicly available.

Abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR

^{3}

-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR

^{3}

-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR

^{3}

-Agent based on multiple state-of-the-art language models demonstrate that DR

^{3}

-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.