AI Navigate

Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

arXiv cs.LG / 3/13/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper analyzes how calibration affects predictive multiplicity in classifiers and whether post-hoc calibration can reduce algorithmic arbitrariness in high-stakes credit decisions.
  • Using nine diverse credit risk benchmark datasets, it shows predictive multiplicity tends to concentrate in low-confidence regions and disproportionately affects minority class observations.
  • Post-hoc calibration methods such as Platt Scaling, Isotonic Regression, and Temperature Scaling are associated with lower obscurity across the Rashomon set, with Platt Scaling and Isotonic Regression performing best.
  • The findings suggest calibration can act as a consensus-enforcing layer and support procedural fairness in credit scoring.

Abstract

As machine learning models are increasingly deployed in high-stakes environments, ensuring both probabilistic reliability and prediction stability has become critical. This paper examines the interplay between classification calibration and predictive multiplicity - the phenomenon in which multiple near-optimal models within the Rashomon set yield conflicting credit outcomes for the same applicant. Using nine diverse credit risk benchmark datasets, we investigate whether predictive multiplicity concentrates in regions of low predictive confidence and how post-hoc calibration can mitigate algorithmic arbitrariness. Our empirical analysis reveals that minority class observations bear a disproportionate multiplicity burden, as confirmed by significant disparities in predictive multiplicity and prediction confidence. Furthermore, our empirical comparisons indicate that applying post-hoc calibration methods - specifically Platt Scaling, Isotonic Regression, and Temperature Scaling - is associated with lower obscurity across the Rashomon set. Among the tested techniques, Platt Scaling and Isotonic Regression provide the most robust reduction in predictive multiplicity. These findings suggest that calibration can function as a consensus-enforcing layer and may support procedural fairness by mitigating predictive multiplicity.