Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

arXiv cs.CL / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The MADQA benchmark introduces 2,250 human-authored questions grounded in 800 heterogeneous PDF documents to study whether multimodal agents exhibit strategic reasoning or rely on brute-force search.
The design uses Classical Test Theory to maximize discriminative power across varying agentic abilities and implements a new evaluation protocol that measures the accuracy-effort trade-off.
The study finds that top agents can match human searchers in raw accuracy but answer largely different questions, rely on brute-force search to compensate for weak planning, and fail to close about a 20% gap to oracle performance due to unproductive loops.
The authors release MADQA and its evaluation harness to promote a shift from brute-force retrieval toward calibrated, efficient reasoning in document-intensive workflows.

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer