Towards E-Value Based Stopping Rules for Bayesian Deep Ensembles

arXiv stat.ML / 4/21/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies Bayesian Deep Ensembles (BDEs), aiming to reduce the high cost of long MCMC sampling used for uncertainty quantification in deep learning.
  • It proposes an E-value based stopping rule to determine when sequential MCMC sampling is no longer providing statistically significant gains over an already-optimized deep ensemble baseline.
  • The method is formalized as a sequential, anytime-valid hypothesis test, enabling principled early stopping by testing whether MCMC truly improves performance versus a strong null baseline.
  • Experiments across multiple settings show the approach is effective and can often reach similar benefits using only a fraction of the full sampling budget.
  • The key practical takeaway is a theoretically grounded criterion for shortening sampling runs without sacrificing meaningful improvement.

Abstract

Bayesian Deep Ensembles (BDEs) represent a powerful approach for uncertainty quantification in deep learning, combining the robustness of Deep Ensembles (DEs) with flexible multi-chain MCMC. While DEs are affordable in most deep learning settings, (long) sampling of Bayesian neural networks can be prohibitively costly. Yet, adding sampling after optimizing the DEs has been shown to yield significant improvements. This leaves a critical practical question: How long should the sequential sampling process continue to yield significant improvements over the initial optimized DE baseline? To tackle this question, we propose a stopping rule based on E-values. We formulate the ensemble construction as a sequential anytime-valid hypothesis test, providing a principled way to decide whether or not to reject the null hypothesis that MCMC offers no improvement over a strong baseline, to early stop the sampling. Empirically, we study this approach for diverse settings. Our results demonstrate the efficacy of our approach and reveal that only a fraction of the full-chain budget is often required.