RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers
arXiv cs.CV / 3/20/2026
📰 NewsModels & Research
Key Points
- The study applies a Vision Transformer (ViT) based network, fine-tuned for multi-label classification on capsule endoscopic videos, using batch size 16 and 224x224 input patches.
- It defines 17 labels, covering anatomical regions (mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve) and findings (active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, ulcer), and tests on three videos from Gastro Competition.
- On the test set of three videos, the reported mean average precision is 0.0205 at IoU 0.5 and 0.0196 at IoU 0.95, indicating very limited performance for this task so far.
- The work demonstrates the feasibility of applying transformers to capsule endoscopic video analysis but underscores the need for better datasets and architectures to improve rare-disease detection in medical imaging.
Related Articles
When AI Grows Up: Identity, Memory, and What Persists Across Versions
Dev.to
OpenAI is throwing everything into building a fully automated researcher
MIT Technology Review
Kimi just published a paper replacing residual connections in transformers. results look legit
Reddit r/LocalLLaMA
機械学習の最適化対象まとめ(E資格対策にも)
Qiita

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to