Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Evo-MedAgent is a new tool-augmented LLM medical agent framework for chest X-ray interpretation that addresses the limitation of solving each case in isolation.
  • It introduces a self-evolving test-time memory module with three components: retrospective clinical episode retrieval, adaptive procedural heuristics that improve via reflection, and a tool reliability controller that tracks per-tool trustworthiness.
  • Experiments on ChestAgentBench show substantial accuracy gains, improving MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and from 0.76 to 0.87 on Gemini-3 Flash.
  • The method does not require additional training, and keeps per-case overhead limited to one extra retrieval pass plus a single reflection call, enabling deployment on top of frozen base models.
  • Overall, the work claims that evolving inter-case memory can improve qualitative diagnostic performance more effectively than relying solely on orchestrating external tools.

Abstract

Tool-augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question-answering modules to interpret chest X-rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool-use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo-MedAgent, a self-evolving memory module that equips a medical agent with the capacity for inter-case learning at test time. Our memory comprises three complementary stores: (1)~\emph{Retrospective Clinical Episodes} that retrieve problem-solving experiences from similar past cases, (2)~an \emph{Adaptive Procedural Heuristics} bank curating priority-tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emph{Tool Reliability Controller} that tracks per-tool trustworthiness. On ChestAgentBench, Evo-MedAgent raises multiple-choice question (MCQ) accuracy from 0.68 to 0.79 on GPT-5-mini, and from 0.76 to 0.87 on Gemini-3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo-MedAgent requires no training, its per-case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.