RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

arXiv cs.CV / 3/20/2026

📰 NewsModels & Research

共有:

Key Points

The study applies a Vision Transformer (ViT) based network, fine-tuned for multi-label classification on capsule endoscopic videos, using batch size 16 and 224x224 input patches.
It defines 17 labels, covering anatomical regions (mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve) and findings (active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, ulcer), and tests on three videos from Gastro Competition.
On the test set of three videos, the reported mean average precision is 0.0205 at IoU 0.5 and 0.0196 at IoU 0.95, indicating very limited performance for this task so far.
The work demonstrates the feasibility of applying transformers to capsule endoscopic video analysis but underscores the need for better datasets and architectures to improve rare-disease detection in medical imaging.

Abstract

This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

Interesting loop

Reddit r/LocalLLaMA

Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants

Reddit r/LocalLLaMA

FeatherOps: Fast fp8 matmul on RDNA3 without native fp8

Reddit r/LocalLLaMA

VerityFlow-AI: Engineering a Multi-Agent Swarm for Real-Time Truth-Validation and Deep-Context Media Synthesis

Dev.to

: [R] Sinc Reconstruction for LLM Prompts: Applying Nyquist-Shannon to the Specification Axis (275 obs, 97% cost reduction, open source)

Reddit r/MachineLearning

RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

Key Points

Abstract

Related Articles

Interesting loop

Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants

FeatherOps: Fast fp8 matmul on RDNA3 without native fp8

VerityFlow-AI: Engineering a Multi-Agent Swarm for Real-Time Truth-Validation and Deep-Context Media Synthesis

: [R] Sinc Reconstruction for LLM Prompts: Applying Nyquist-Shannon to the Specification Axis (275 obs, 97% cost reduction, open source)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer