AI Navigate

RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

arXiv cs.CV / 3/20/2026

📰 NewsModels & Research

Key Points

  • The study applies a Vision Transformer (ViT) based network, fine-tuned for multi-label classification on capsule endoscopic videos, using batch size 16 and 224x224 input patches.
  • It defines 17 labels, covering anatomical regions (mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve) and findings (active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, ulcer), and tests on three videos from Gastro Competition.
  • On the test set of three videos, the reported mean average precision is 0.0205 at IoU 0.5 and 0.0196 at IoU 0.95, indicating very limited performance for this task so far.
  • The work demonstrates the feasibility of applying transformers to capsule endoscopic video analysis but underscores the need for better datasets and architectures to improve rare-disease detection in medical imaging.

Abstract

This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.