Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

arXiv cs.AI / 5/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study benchmarks five multimodal LLMs (four open-weight and one commercial) on dermatology-specific tasks but finds their public benchmark results do not translate well to real-world clinical decision-making.
Differential diagnosis performance drops sharply in a hospital-based multi-site retrospective cohort of 5,811 cases and 46,405 clinical images, with top-3 diagnostic accuracy reaching only 1.50%–13.35% for open-weight models using images alone.
Adding clinical context improves accuracy across all models, raising top-3 diagnostic accuracy to as high as 28.75% for open-weight models and 38.93% for GPT-4.1, yet outputs remain highly sensitive to incomplete or incorrect context.
For severity-based triage, models show moderate sensitivity (over 60%), indicating possible value for preliminary screening but not enough reliability for clinical deployment.
Overall, the findings suggest that current dermatology multimodal LLMs are not ready for bedside use, and benchmark metrics substantially overestimate real-world capability.

Abstract

Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.