Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study applies a Vision Transformer (ViT) to classify anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL) using histology image patches.
  • It builds on earlier fully supervised results (trained on 1,200 patches) that reached 100% accuracy and an F1 score of 1.0 on an independent test set.
  • To make the approach more clinically practical, the authors switch to weakly supervised training by using slide-level labels to automatically label patch-level training data.
  • With a much larger dataset of 100,000 image patches, the weakly supervised ViT achieves evaluation metrics of 91.85% accuracy, F1 = 0.92, and AUC = 0.98.
  • The authors conclude the weakly supervised ViT is suitable as a deep learning module for clinical model development when automated patch extraction is feasible.

Abstract

Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.