VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection
arXiv cs.CV / 3/20/2026
📰 NewsModels & Research
Key Points
- The paper reframes capsule endoscopy event detection as a metric-aligned Rare-VISION task, focusing on event-level evaluation rather than frame-level accuracy.
- It fuses two backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, with a Diverse Head Ensemble and Validation-Guided Hierarchical Fusion.
- The decoding stage applies anatomy-aware temporal decoding, smoothing, threshold refinement, and per-label event generation to yield stable event predictions.
- Ablation studies show that combining complementary backbones with validation-guided fusion and anatomy-aware decoding improves event-level performance, achieving temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235 on a hidden test set.
Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4
Dev.to
[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly
Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team
THE DECODER
MolmoWeb 4B/8B
Reddit r/LocalLLaMA

Ai2 releases MolmoWeb, an open-weight visual web agent with 30K human task trajectories and a full training stack
VentureBeat