Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that class-level evaluation in imbalanced classification can hide large performance gaps across different subconcepts within the same class.
  • It notes prior mitigation methods rely on true subconcept labels at test time, which are often unavailable in real settings.
  • To address this, the authors propose a utility-weighted evaluation that substitutes missing subconcept labels with posterior probabilities from a multiclass subconcept model.
  • They define the resulting soft, uncertainty-aware metric called predicted-weighted balanced accuracy (pBA), which aims to produce more stable and interpretable assessments.
  • Experiments across tabular, medical-imaging, and text benchmarks show that unweighted metrics can be misleading under within-class heterogeneity, while pBA better reflects subgroup performance when the subconcept distributions are uneven but not extreme.

Abstract

Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels can mitigate this bias; however, such labels are rarely available at test time. We introduce a practical utility-weighted evaluation that replaces unavailable subconcept labels with predicted posterior probabilities from a multiclass subconcept model. Evaluation weights are defined as the expected utility under this posterior, yielding a soft, uncertainty-aware metric we call predicted-weighted balanced accuracy (pBA). Experiments on tabular benchmarks as well as medical-imaging and text datasets show that unweighted scores can be misleading under within-class heterogeneity, while pBA provides more stable and interpretable assessments when subconcept distributions are uneven but not pathological. Our code is available at: https://anonymous.4open.science/r/correcting-bias-imbalance-9C6C/.