VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

arXiv cs.CV / 3/20/2026

📰 NewsModels & Research

共有:

Key Points

The paper reframes capsule endoscopy event detection as a metric-aligned Rare-VISION task, focusing on event-level evaluation rather than frame-level accuracy.
It fuses two backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, with a Diverse Head Ensemble and Validation-Guided Hierarchical Fusion.
The decoding stage applies anatomy-aware temporal decoding, smoothing, threshold refinement, and per-label event generation to yield stable event predictions.
Ablation studies show that combining complementary backbones with validation-guided fusion and anatomy-aware decoding improves event-level performance, achieving temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235 on a hidden test set.

Abstract

Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Dev.to

Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train a CartPole Reinforcement Learning Agent

MarkTechPost

[D] Training a classifier entirely in SQL (no iterative optimization)

Reddit r/MachineLearning

LLM failure modes map surprisingly well onto ADHD cognitive science. Six parallels from independent research.

Reddit r/artificial

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

Key Points

Abstract

Related Articles

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train a CartPole Reinforcement Learning Agent

[D] Training a classifier entirely in SQL (no iterative optimization)

LLM failure modes map surprisingly well onto ADHD cognitive science. Six parallels from independent research.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer