Multimodal Deep Learning for Diabetic Foot Ulcer Staging Using Integrated RGB and Thermal Imaging

arXiv cs.CV / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study proposes multimodal deep learning for diabetic foot ulcer (DFU) staging by integrating simultaneously captured RGB and thermal images to improve early diagnosis and monitoring.
  • A Raspberry Pi-based portable imaging system was built to collect hospital data, resulting in a labeled dataset of 1,205 expert-annotated samples across six DFU stages.
  • Models were trained on three variants (RGB-only, thermal-only, and RGB+Thermal as a 4-channel input) using DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16.
  • Results indicate the combined RGB+Thermal approach outperforms single-modality training, with the best performance coming from VGG16 using RGB+Thermal (accuracy 93.25%, F1 92.53%, MCC 91.03%).
  • Grad-CAM visualizations suggest the thermal channel helps localize ulcer-related temperature anomalies while the RGB channel provides complementary structural and texture cues.

Abstract

Diabetic foot ulcers (DFU) are one of the serious complications of diabetes that can lead to amputations and high healthcare costs. Regular monitoring and early diagnosis are critical for reducing the clinical burden and the risk of amputation. The aim of this study is to investigate the impact of using multimodal images on deep learning models for the classification of DFU stages. To this end, we developed a Raspberry Pi-based portable imaging system capable of simultaneously capturing RGB and thermal images. Using this prototype, a dataset consisting of 1,205 samples was collected in a hospital setting. The dataset was labeled by experts into six distinct stages. To evaluate the models performance, we prepared three different training sets: RGB-only, thermal-only, and RGB+Thermal (with the thermal image added as a fourth channel). We trained these training sets on the DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16 models. The results show that the multimodal training dataset, in which RGB and thermal data are combined across four channels, outperforms single-modal approaches. The highest performance was observed in the VGG16 model trained on the RGB+Thermal dataset. The model achieved an accuracy of 93.25%, an F1-score of 92.53%, and an MCC of 91.03%. Grad-CAM heatmap visualizations demonstrated that the thermal channel helped the model focus on the correct location by highlighting temperature anomalies in the ulcer region, while the RGB channel supported the decision-making process with complementary structural and textural information.

Multimodal Deep Learning for Diabetic Foot Ulcer Staging Using Integrated RGB and Thermal Imaging | AI Navigate