The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality

arXiv stat.ML / 5/6/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The Manokhin Probability Matrix is a new diagnostic framework that splits classifier probability quality into two components—reliability (calibration) and resolution (discrimination)—addressing the limitation of the single-number Brier score.
  • Classifier performance is mapped onto a 2x2 grid using the Spiegelhalter Z-statistic (calibration) and an AUC-ROC expected rank, producing four actionable archetypes: Eagle, Bull, Sloth, and Mole.
  • The study of 21 classifiers, 5 post-hoc calibrators, and 30 TabArena-v0.1 binary tasks assigns clear archetypes: CatBoost/TabICL/EBM/TabPFN/GBC/Random Forest as Eagles; XGBoost/LightGBM/HGB as Bulls; SVM/LR/LDA/base-rate predictor as Sloths; and MLP/KNN/Naive Bayes/ExtraTrees as Moles.
  • Results show that calibration methods can improve log-loss for Bulls (6.5%–12.6%) but may slightly harm Eagles (−2.1%), and a theory result (Proposition 1) states that order-preserving post-hoc calibration cannot increase discriminatory power.
  • The recommended practice is to decompose Brier score before optimization: optimize for discrimination first, then apply post-hoc calibration to correct reliability, with code and experimental data released on GitHub.

Abstract

The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG-style two-dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z-statistic and AUC-ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well-calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large-scale empirical study spanning 21 classifiers, 5 post-hoc calibrators, and 30 real-world binary classification tasks from the TabArena-v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and HGB are Bulls; Venn-Abers calibration cuts log-loss by 6.5 to 12.6% on Bulls but degrades Eagles by 2.1%. SVM, LR, LDA, and the empirical base-rate predictor are Sloths. MLP, KNN, Naive Bayes, and ExtraTrees are Moles. A theoretical asymmetry follows: no order-preserving post-hoc calibrator can add discriminatory power (Proposition 1), so calibration is the fixable part and discrimination is the hard part. The practical rule is direct: do not optimise aggregate Brier score without first decomposing it; optimise discrimination first, then fix calibration post-hoc. Code and raw experimental data are available at https://github.com/valeman/classifier_calibration.