Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Concept Bottleneck Models (CBMs) can be fundamentally limited by concept-level inconsistencies in a dataset, because identical concept profiles mapped to conflicting diagnosis labels create an unresolvable interpretability bottleneck.
  • Using rough-set analysis on the Derm7pt dermoscopy benchmark, the study finds 50 of 305 unique concept profiles (16.4%) are inconsistent, affecting 306 images (30.3%) and implying a theoretical hard accuracy ceiling of 92.1% for CBMs that rely on hard concept labels.
  • The paper analyzes how conflict severity is distributed and which clinical features most contribute to boundary ambiguity, then compares two filtering approaches that change dataset composition and CBM interpretability.
  • After symmetric filtering, the authors introduce Derm7pt+ (705 images) as a fully consistent subset with perfect classification quality and no hard accuracy ceiling, and they evaluate a hard CBM across 19 backbone architectures to provide reproducible benchmarks.
  • Results show EfficientNet variants perform best under different filtering schemes (e.g., EfficientNet-B5 under symmetric filtering and EfficientNet-B7 under asymmetric filtering), establishing baselines for concept-consistent CBM evaluation in dermoscopic settings.

Abstract

Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.