SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • SkinCLIP-VL is a resource-efficient vision-language learning framework aimed at improving multimodal skin cancer diagnosis under limited data and high compute constraints.
  • The method freezes a CLIP encoder and uses a lightweight, quantized Qwen2.5-VL with LoRA-based low-rank adaptation to reduce model size while maintaining performance.
  • It introduces a Consistency-aware Focal Alignment (CFA) loss to align visual regions with clinical semantics more reliably, especially under long-tailed data distributions.
  • On ISIC and Derm7pt benchmarks, SkinCLIP-VL improves accuracy over 13B-parameter baselines by 4.3–6.2% while using 43% fewer parameters.
  • Blinded expert evaluation and out-of-distribution testing suggest the model’s visually grounded rationales increase clinical trust compared with traditional saliency-map approaches.

Abstract

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.