RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers
arXiv cs.CV / 3/20/2026
📰 NewsModels & Research
Key Points
- The study applies a Vision Transformer (ViT) based network, fine-tuned for multi-label classification on capsule endoscopic videos, using batch size 16 and 224x224 input patches.
- It defines 17 labels, covering anatomical regions (mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve) and findings (active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, ulcer), and tests on three videos from Gastro Competition.
- On the test set of three videos, the reported mean average precision is 0.0205 at IoU 0.5 and 0.0196 at IoU 0.95, indicating very limited performance for this task so far.
- The work demonstrates the feasibility of applying transformers to capsule endoscopic video analysis but underscores the need for better datasets and architectures to improve rare-disease detection in medical imaging.
Related Articles

Interesting loop
Reddit r/LocalLLaMA
Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants
Reddit r/LocalLLaMA
FeatherOps: Fast fp8 matmul on RDNA3 without native fp8
Reddit r/LocalLLaMA

VerityFlow-AI: Engineering a Multi-Agent Swarm for Real-Time Truth-Validation and Deep-Context Media Synthesis
Dev.to
: [R] Sinc Reconstruction for LLM Prompts: Applying Nyquist-Shannon to the Specification Axis (275 obs, 97% cost reduction, open source)
Reddit r/MachineLearning