DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

arXiv cs.AI / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • DeepER-Med is an agentic AI framework for medicine that targets trustworthy and transparent evidence-based research by making the workflow explicit and inspectable.
  • The system organizes deep medical research into three modules—research planning, agentic collaboration, and evidence synthesis—to reduce compounding errors from weak or uncheckable evidence appraisal.
  • The work introduces DeepER-MedQA, a benchmark dataset of 100 expert-level, evidence-grounded medical research questions based on real research scenarios.
  • Reported evaluations (including expert manual scoring and human clinician review across eight clinical cases) indicate DeepER-Med can outperform common production platforms and align with clinical recommendations in most cases.
  • The authors argue that the framework improves realism in benchmarking by evaluating performance on complex, real-world medical questions rather than only simplified tasks.

Abstract

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.