From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

arXiv cs.LG / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current deep research agent (DRA) evaluations rely on ad hoc benchmarks that fail to rigorously model agent behavior, especially for long-horizon synthesis and ambiguity handling.
It proposes a category-theory-based framework that represents DRA workflows as compositions of structure-preserving maps (functors), enabling more formal structural evaluation.
The authors introduce a mechanism-aware benchmark of 296 questions evaluated across four interpretable stress-testing axes (sequential connectivity, V-structure intersection verification, topological ordering, and ontological falsification via the Yoneda Probe).
Testing 11 leading models shows a persistently low performance baseline, with state-of-the-art reaching only 19.9% average accuracy and a capability split: agents do well on certain structural verification tasks but largely fail on multi-hop structural synthesis.
Large performance variance across tasks suggests existing systems still depend heavily on brittle heuristics rather than a systemic understanding of complex structural information.

Abstract

Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9\% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification -- matching pure reasoning models in falsifying hallucinated premises -- they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnote{Our implementation will be available at https://github.com/tzq1999/CDR.