CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces the CURE benchmark for multimodal clinical AI, designed to disentangle reasoning from evidence retrieval by using 500 clinical cases linked to physician-cited literature.
It evaluates state-of-the-art multimodal LLMs across closed-ended and open-ended diagnostic tasks under different evidence-gathering paradigms.
Findings show a stark gap: models achieve up to 73.4% accuracy when given physician reference evidence, but drop to as low as 25.4% when relying on independent retrieval.
CURE's availability on GitHub enables broader benchmarking and highlights the need to improve retrieval and evidence-grounded multimodal reasoning in clinical AI.

Abstract

Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising

500

multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to

73.4\%

accuracy on differential diagnosis), their performance substantially declines (as low as

25.4\%

) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.

Is AI becoming a bubble, and could it end like the dot-com crash?

Reddit r/artificial

Externalizing State

Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to

My AI Does Not Have a Clock

Dev.to

How to settle on a coding LLM ? What parameters to watch out for ?

Reddit r/LocalLLaMA

CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation

Key Points

Abstract

Related Articles

Is AI becoming a bubble, and could it end like the dot-com crash?

Externalizing State

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

My AI Does Not Have a Clock

How to settle on a coding LLM ? What parameters to watch out for ?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer