Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics

arXiv cs.LG / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that the commonly used Expected Calibration Error (ECE) can fail by reporting low error even when a model has arbitrarily large overconfidence risk.
It proposes a new metric, Calibrated Size Ratio (CSR), which equals 1 under perfect calibration and is used to derive a risk probability (P_risk) that quantifies statistical evidence of overconfidence.
The authors contend that overconfidence risk assessment should be paired with discriminative value—how well confidence scores separate correct from incorrect predictions.
They introduce confidence-weighted accuracy (cwA) and show how confidence weighting extends to standard classification metrics, proving that confidence-weighted AUC (cwAUC) preserves calibration information that classical AUC misses.
Empirical validation on synthetic distributions and 15 real datasets (with and without post-hoc calibration) shows CSR provides near-perfect sensitivity and specificity across tested calibration profiles.

Abstract

Confidence calibration has been dominated by the Expected Calibration Error (ECE), a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs. We show that ECE can remain small even under arbitrarily large overconfidence risk, so we propose Calibrated Size Ratio (CSR) instead, an interpretable metric that equals 1 under perfect calibration, from which we derive the risk probability

P_{\mathrm{risk}}

that quantifies the statistical evidence for overconfidence. We further argue that overconfidence risk assessment must be complemented by a measure of discriminative value: whether the assigned confidences actively distinguish correct from incorrect predictions. We show that confidence-weighted accuracy

\mathrm{cwA}

is the natural such complement, and that confidence-weighting extends to all standard classification metrics. In particular, we prove that the confidence-weighted AUC (cwAUC) captures the information about calibration while the classical AUC cannot. We validate the proposed indicators on several synthetic confidence distributions under multiple controlled calibration profiles and on fifteen real datasets with and without post-hoc calibration. Experiments demonstrate that CSR achieves near-perfect sensitivity and specificity across all tested conditions.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models

The Verge

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

TechCrunch

Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Meta will use AI to analyze height and bone structure to identify if users are underage

Google, Microsoft, and xAI will allow the US government to review their new AI models

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer