SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • SkinGPT-X is a multimodal collaborative multi-agent dermatological diagnosis system designed to improve interpretability and traceability beyond what monolithic LLMs can provide for fine-grained and rare skin disease cases.
  • The approach introduces a self-evolving dermatological memory mechanism that adapts over time instead of relying on static knowledge bases, aiming to better fit real-world clinical complexity.
  • The paper reports state-of-the-art results versus four leading LLMs across multiple public datasets, including +9.6% accuracy on DDI31 and +13% weighted F1 on Dermnet.
  • To evaluate fine-grained and rare-disease performance, the authors compile a 498-category dataset and a rare-skin-disease benchmark with 564 samples spanning eight rare conditions, where SkinGPT-X improves accuracy by +9.8% and shows gains in weighted F1 and Cohen’s Kappa.
  • A three-tier comparative experimental design is used to assess robustness, positioning SkinGPT-X as a research contribution toward more trustworthy, clinically aligned AI diagnostic reasoning pipelines.

Abstract

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.