Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The MADQA benchmark introduces 2,250 human-authored questions grounded in 800 heterogeneous PDF documents to study whether multimodal agents exhibit strategic reasoning or rely on brute-force search.
- The design uses Classical Test Theory to maximize discriminative power across varying agentic abilities and implements a new evaluation protocol that measures the accuracy-effort trade-off.
- The study finds that top agents can match human searchers in raw accuracy but answer largely different questions, rely on brute-force search to compensate for weak planning, and fail to close about a 20% gap to oracle performance due to unproductive loops.
- The authors release MADQA and its evaluation harness to promote a shift from brute-force retrieval toward calibrated, efficient reasoning in document-intensive workflows.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA