Enforcing tail calibration when training probabilistic forecast models

arXiv stat.ML / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Probabilistic forecasting models can become miscalibrated when their model class is misspecified, leading to unreliable probability estimates for users’ decision-making.
  • The study proposes modifying training loss functions—using weighted proper scoring rules and adding regularization based on tail miscalibration—to improve reliability specifically for extreme events.
  • Experiments on UK wind-speed forecasts across increasingly flexible model families (parametric models, distributional regression networks, and conditional generative models) show that state-of-the-art systems may still produce poorly calibrated extreme predictions.
  • The authors find that improving calibration for extreme events introduces a trade-off, since it can affect calibration for more common (less extreme) outcomes.
  • The work suggests a practical path to better probabilistic reliability by tailoring the objective function to penalize tail errors during training.

Abstract

Probabilistic forecasts are typically obtained using state-of-the-art statistical and machine learning models, with model parameters estimated by optimizing a proper scoring rule over a set of training data. If the model class is not correctly specified, then the learned model will not necessarily issue forecasts that are calibrated. Calibrated forecasts allow users to appropriately balance risks in decision making, and it is particularly important that forecast models issue calibrated predictions for extreme events, since such outcomes often generate large socio-economic impacts. In this work, we study how the loss function used to train probabilistic forecast models can be adapted to improve the reliability of forecasts made for extreme events. We investigate loss functions based on weighted scoring rules, and additionally propose regularizing loss functions using a measure of tail miscalibration. We apply these approaches to a hierarchy of increasingly flexible forecast models for UK wind speeds, including simple parametric models, distributional regression networks, and conditional generative models. We demonstrate that state-of-the-art models do not issue calibrated forecasts for extreme wind speeds, and that the calibration of forecasts for extreme events can be improved by suitable adaptations to the loss function during model training. This introduces a trade-off between calibrated forecasts for extreme events and calibrated forecasts for more common outcomes.