From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models

arXiv cs.CV / 4/28/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes an interpretable diabetic retinopathy (DR) grading approach that transforms retinal pixels into clinically meaningful, multimodal explanations.
It benchmarks six CNN- and Transformer-based model backbones on the APTOS 2019 dataset using a controlled protocol with stratified five-fold cross-validation, finding ResNet-50 and ConvNeXt-Tiny as the strongest single-model baselines.
Multiple ensembling methods are evaluated (hard voting, weighted soft voting, stacking, and a hybrid class-level fusion), with weighted soft voting delivering the most consistent performance improvements across folds.
For interpretability, the study combines Grad-CAM++ visual attribution maps with vision-language model (VLM) generated short textual rationales, observing generally grade-consistent explanations but coarse localization from Grad-CAM++.
The VLM component shows a quantitative trade-off between clinical completeness and template-level semantic similarity, while image-text alignment (e.g., CLIPScore) remains broadly comparable across variants.

Abstract

The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).