When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study proposes that AI misclassifications in dermatoscopic diagnosis may reflect intrinsic visual ambiguity in the images rather than solely model bias.
  • Across multiple CNN architectures, the authors isolate a subset of images that are systematically misclassified by all models, showing this error pattern occurs significantly more than random chance.
  • Expert dermatologists exhibit a major performance collapse on these AI-misclassified “difficult” images, with agreement to ground truth dropping sharply (Cohen’s kappa 0.08 vs. 0.61 for controls) and inter-rater reliability weakening (Fleiss kappa 0.275 vs. 0.456).
  • The research identifies image quality as a key factor driving both model and human failure modes, suggesting data/quality limitations can undermine both automated and expert diagnosis.
  • To support transparency and reproducibility, the authors publicly release the data, code, and trained models alongside the arXiv submission.

Abstract

The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available