Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
arXiv cs.AI / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper evaluates whether LLM agents can generate real-world evidence end-to-end in medical databases by reproducing observational studies using database execution plus coherent reporting rather than isolated QA steps.
- It introduces RWE-bench, a benchmark built from MIMIC-IV and peer-reviewed observational studies, where agents receive the study protocol as the reference standard and must produce tree-structured evidence bundles.
- Across 162 tasks with six LLMs and three different agent scaffolds, overall task success is low, with the best agent at 39.9% and the best open-source model at 30.4%.
- Agent scaffold selection significantly affects outcomes, driving over 30% variation in performance, indicating that workflow design is a key determinant of results.
- The authors also add an automated cohort evaluation method to pinpoint error locations and characterize agent failure modes, and conclude that efficient validation is a major open direction.
Related Articles
The Security Gap in MCP Tool Servers (And What I Built to Fix It)
Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy
Reddit r/artificial
Why I Switched From GPT-4 to Small Language Models for Two of My Products
Dev.to
Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development
Dev.to
In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!
Reddit r/artificial