Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

arXiv cs.AI / 5/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study benchmarks five multimodal LLMs (four open-weight and one commercial) on dermatology-specific tasks but finds their public benchmark results do not translate well to real-world clinical decision-making.
  • Differential diagnosis performance drops sharply in a hospital-based multi-site retrospective cohort of 5,811 cases and 46,405 clinical images, with top-3 diagnostic accuracy reaching only 1.50%–13.35% for open-weight models using images alone.
  • Adding clinical context improves accuracy across all models, raising top-3 diagnostic accuracy to as high as 28.75% for open-weight models and 38.93% for GPT-4.1, yet outputs remain highly sensitive to incomplete or incorrect context.
  • For severity-based triage, models show moderate sensitivity (over 60%), indicating possible value for preliminary screening but not enough reliability for clinical deployment.
  • Overall, the findings suggest that current dermatology multimodal LLMs are not ready for bedside use, and benchmark metrics substantially overestimate real-world capability.

Abstract

Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.