CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper argues that reported chest CT segmentation results are often inflated because train and test splits accidentally share slices from the same patient study.
  • It introduces CTSCAN, a reproducible multi-source benchmark and research stack that specifically evaluates models under patient-disjoint (case-disjoint) conditions.
  • Using the same FPN + EfficientNet-B0 baseline across a multi-seed sweep, the study shows large performance drops when switching from slice-mixed to case-disjoint evaluation (foreground Dice: 0.6665 → 0.2066; foreground IoU: 0.5031 → 0.1181).
  • The authors quantify the impact of eliminating patient reuse as a 0.4599 absolute (69% relative) decrease in foreground Dice and a 0.3850 absolute (76.52% relative) decrease in foreground IoU.
  • CTSCAN includes deterministic split manifests, weak-supervision controls, scripted multi-seed protocol sweeps, and reproducible figure generation to support fair future comparisons.

Abstract

Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.