TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

arXiv cs.LG / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • TwinTrack is introduced as a post-hoc multi-rater calibration framework for medical image segmentation, targeting the ambiguity caused by inter-expert disagreement.
  • It calibrates ensemble segmentation probabilities to the empirical mean human response (MHR), defined as the fraction of expert annotators labeling each voxel as tumor.
  • The resulting calibrated probabilities are directly interpretable as the expected proportion of annotators who would assign the tumor label, explicitly reflecting uncertainty.
  • The calibration method is described as simple and requiring only a small multi-rater calibration dataset.
  • Evaluations on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark show consistent improvements in calibration metrics versus standard approaches.

Abstract

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.