A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

arXiv cs.LG / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

In small-data Bayesian deep learning, reporting a single-seed evaluation metric value (e.g., CRPS) as if it were deterministic can be misleading because the metric’s endpoint is a random variable.
Across 50 independent runs on six regression datasets, CRPS variance trajectories vary widely by method: MAP and Deep Ensembles can show reproducible variance peaks at intermediate training sizes, while MC Dropout and Bayes by Backprop typically exhibit smooth variance decay.
These variance peaks materially affect reliability—for example, on the Seoul Bike dataset the relative RMSE of a single-seed MAP estimate reaches 93.6%, and the chance of landing within ±10% of the repeated-run mean falls to 5.9%.
Local CRPS variance is shown to be a strong signal of single-seed estimation error (Spearman correlation >0.96 across real datasets), and improving the training objective by using β-NLL substantially reduces the irregular variance behavior.

Abstract

In limited-data settings, a single endpoint mean of an evaluation metric such as the Continuous Ranked Probability Score (CRPS) is itself a random variable, yet it is routinely reported as if it were a stable property of the method. We study when this practice fails. Using 50 independent repetitions across six regression datasets, we show that CRPS variance trajectories differ substantially across methods and are not always well described by a smooth power-law decay. Methods with a learned heteroscedastic variance head, namely MAP and Deep Ensembles, can develop pronounced, reproducible variance peaks at intermediate training sizes on real datasets, whereas MC Dropout and Bayes by Backprop typically show smooth variance contraction. These peaks have direct practical consequences: at the variance peak on Seoul Bike, the relative RMSE of a single-seed MAP estimate reaches 93.6\%, and the probability of falling within $\pm 10\%$ of the repeated-run mean drops to 5.9\%. We show that local CRPS variance provides a direct signal of single-seed estimation error, with Spearman correlations above 0.96 on every real dataset. Power-law fit quality and monotonicity together provide compact method-level summaries of trajectory regularity. Finally, replacing the standard heteroscedastic objective with $\beta$-NLL substantially reduces the irregular behavior, consistent with the view that the heteroscedastic training objective contributes to the instability. Practitioners should report trajectory summaries alongside endpoint means and concentrate repeated evaluation in high-variance regions.