From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models
arXiv cs.CV / 4/28/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an interpretable diabetic retinopathy (DR) grading approach that transforms retinal pixels into clinically meaningful, multimodal explanations.
- It benchmarks six CNN- and Transformer-based model backbones on the APTOS 2019 dataset using a controlled protocol with stratified five-fold cross-validation, finding ResNet-50 and ConvNeXt-Tiny as the strongest single-model baselines.
- Multiple ensembling methods are evaluated (hard voting, weighted soft voting, stacking, and a hybrid class-level fusion), with weighted soft voting delivering the most consistent performance improvements across folds.
- For interpretability, the study combines Grad-CAM++ visual attribution maps with vision-language model (VLM) generated short textual rationales, observing generally grade-consistent explanations but coarse localization from Grad-CAM++.
- The VLM component shows a quantitative trade-off between clinical completeness and template-level semantic similarity, while image-text alignment (e.g., CLIPScore) remains broadly comparable across variants.
Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Most People Use AI Like Google. That's Why It Sucks.
Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI
Dev.to

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy
Dev.to