From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
arXiv cs.LG / 3/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current deep research agent (DRA) evaluations rely on ad hoc benchmarks that fail to rigorously model agent behavior, especially for long-horizon synthesis and ambiguity handling.
- It proposes a category-theory-based framework that represents DRA workflows as compositions of structure-preserving maps (functors), enabling more formal structural evaluation.
- The authors introduce a mechanism-aware benchmark of 296 questions evaluated across four interpretable stress-testing axes (sequential connectivity, V-structure intersection verification, topological ordering, and ontological falsification via the Yoneda Probe).
- Testing 11 leading models shows a persistently low performance baseline, with state-of-the-art reaching only 19.9% average accuracy and a capability split: agents do well on certain structural verification tasks but largely fail on multi-hop structural synthesis.
- Large performance variance across tasks suggests existing systems still depend heavily on brittle heuristics rather than a systemic understanding of complex structural information.
広告
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




