DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

arXiv cs.AI / 4/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces DRBENCHER, a synthetic benchmark designed to test deep research agents that must both browse to identify an entity and then perform multi-step computation on retrieved properties.
DRBENCHER generates questions using an answer-first pipeline with four explicit criteria: verifiability via executable parameterized code over knowledge-graph values, complexity through multi-hop entity/property retrieval plus domain-specific math, and difficulty using a two-stage verification cascade to filter out trivial solutions.
Across five domains (biochemistry, financial, geophysical, security, and history), human evaluation finds 76% validity (84% excluding stale data), and it reports that 35% of errors stem from outdated knowledge-graph entries.
Automatic evaluation indicates that even a strongest frontier model reaches only 20% answer accuracy, underscoring that current agent capabilities still struggle with end-to-end browse-and-compute tasks.
Compared with several manually constructed benchmarks, DRBENCHER emphasizes higher semantic diversity, aiming to reduce blind spots created by evaluating browsing and computation separately.

Abstract

Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.