Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

arXiv cs.CV / 4/24/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces a new diagnosis-driven capsule endoscopy (CE) video summarization task that goes beyond frame-level detection to identify key evidence frames and produce accurate diagnoses.
It addresses the core challenge that clinically relevant findings are extremely sparse and easily drowned out by many normal frames, while observations can be ambiguous due to artifacts like blur, debris, and viewpoint changes.
The authors release VideoCAP, a new CE dataset with diagnosis-driven annotations created from real clinical reports, containing 240 full-length videos for both evidence-frame extraction and diagnosis supervision.
They propose DiCE, a clinician-inspired framework that screens candidates, weaves them into coherent diagnostic contexts, and then aggregates multi-frame evidence to output robust clip-level diagnostic summaries.
Experiments indicate DiCE achieves consistent improvements over state-of-the-art methods, supporting diagnosis-driven contextual reasoning as a promising approach for ultra-long CE video understanding.

Abstract

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

GPT-5.5 System Card

Dev.to

[NeurIPS 2026] Dumb Question about formating [D]

Reddit r/MachineLearning

Multi-Perspective Context Matching for Machine Comprehension

Dev.to

Hermes agent: Introduction

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Key Points

Abstract

Related Articles

GPT-5.5 System Card

[NeurIPS 2026] Dumb Question about formating [D]

Multi-Perspective Context Matching for Machine Comprehension

Hermes agent: Introduction

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer