Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

arXiv cs.CV / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how semantic segmentation models can fail under correlation shift by producing “semantic label flips,” where pixels remain foreground and correctly bounded but receive the wrong class identity.
  • It introduces a diagnostic metric, “Flip,” to quantify how often ground-truth foreground pixels are assigned the wrong foreground label while still being predicted as foreground, enabling a finer-grained error breakdown than overlap alone.
  • Experiments show that stronger correlations between non-causal cues (e.g., category and scene) during training enlarge performance gaps between common and rare counterfactual test conditions and increase within-object label swaps.
  • The authors propose an entropy-based, ground-truth label-free “flip-risk” score to detect flip-prone cases at inference time, and provide accompanying code on GitHub.

Abstract

The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk' score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.