Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
arXiv cs.AI / 5/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study benchmarks five multimodal LLMs (four open-weight and one commercial) on dermatology-specific tasks but finds their public benchmark results do not translate well to real-world clinical decision-making.
- Differential diagnosis performance drops sharply in a hospital-based multi-site retrospective cohort of 5,811 cases and 46,405 clinical images, with top-3 diagnostic accuracy reaching only 1.50%–13.35% for open-weight models using images alone.
- Adding clinical context improves accuracy across all models, raising top-3 diagnostic accuracy to as high as 28.75% for open-weight models and 38.93% for GPT-4.1, yet outputs remain highly sensitive to incomplete or incorrect context.
- For severity-based triage, models show moderate sensitivity (over 60%), indicating possible value for preliminary screening but not enough reliability for clinical deployment.
- Overall, the findings suggest that current dermatology multimodal LLMs are not ready for bedside use, and benchmark metrics substantially overestimate real-world capability.
Related Articles

Why GPU Density Just Broke Two Decades of Data Centre Design Assumptions
Dev.to

Ten Reddit Threads That Make the AI-Agent Boom Look More Like Systems Engineering
Dev.to

Ten Reddit Threads That Made AI Agents Look More Like Infrastructure Than Hype
Dev.to

From Demos to Guardrails: 10 Reddit Threads Tracking the AI-Agent Shift
Dev.to

What Reddit’s Agent Builders Were Actually Debugging This Week
Dev.to