Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos
arXiv cs.CV / 4/24/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces a new diagnosis-driven capsule endoscopy (CE) video summarization task that goes beyond frame-level detection to identify key evidence frames and produce accurate diagnoses.
- It addresses the core challenge that clinically relevant findings are extremely sparse and easily drowned out by many normal frames, while observations can be ambiguous due to artifacts like blur, debris, and viewpoint changes.
- The authors release VideoCAP, a new CE dataset with diagnosis-driven annotations created from real clinical reports, containing 240 full-length videos for both evidence-frame extraction and diagnosis supervision.
- They propose DiCE, a clinician-inspired framework that screens candidates, weaves them into coherent diagnostic contexts, and then aggregates multi-frame evidence to output robust clip-level diagnostic summaries.
- Experiments indicate DiCE achieves consistent improvements over state-of-the-art methods, supporting diagnosis-driven contextual reasoning as a promising approach for ultra-long CE video understanding.
Related Articles
GPT-5.5 System Card
Dev.to
[NeurIPS 2026] Dumb Question about formating [D]
Reddit r/MachineLearning

Multi-Perspective Context Matching for Machine Comprehension
Dev.to

Hermes agent: Introduction
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to