Rethinking Uncertainty in Segmentation: From Estimation to Decision

arXiv cs.AI / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that medical segmentation pipelines typically estimate uncertainty but do not use it to drive downstream actions like accepting, flagging, or deferring predictions.
  • It reframes segmentation as a two-stage process—uncertainty estimation followed by decision-making—and shows that optimizing uncertainty metrics alone misses much of the potential safety improvement.
  • Experiments on retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1) compare uncertainty sources including Monte Carlo Dropout and Test-Time Augmentation, paired with multiple deferral strategies.
  • The authors propose a confidence-aware deferral rule and report that the best method-policy combination can remove up to 80% of segmentation errors while deferring only about 25% of pixels, with strong cross-dataset robustness.
  • A key finding is that improvements in calibration do not necessarily improve decision quality, indicating a disconnect between common uncertainty measures and real-world utility.

Abstract

In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.