AI Navigate

CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the CURE benchmark for multimodal clinical AI, designed to disentangle reasoning from evidence retrieval by using 500 clinical cases linked to physician-cited literature.
  • It evaluates state-of-the-art multimodal LLMs across closed-ended and open-ended diagnostic tasks under different evidence-gathering paradigms.
  • Findings show a stark gap: models achieve up to 73.4% accuracy when given physician reference evidence, but drop to as low as 25.4% when relying on independent retrieval.
  • CURE's availability on GitHub enables broader benchmarking and highlights the need to improve retrieval and evidence-grounded multimodal reasoning in clinical AI.

Abstract

Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising 500 multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to 73.4\% accuracy on differential diagnosis), their performance substantially declines (as low as 25.4\%) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.