VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper benchmarks ten common uncertainty quantification (UQ) approaches across in-distribution, corruption shifts, and out-of-distribution scenarios, highlighting the lack of a universally best method across modalities and distribution shifts.
  • It proposes a simplified, highly effective variant of VOLTA that uses only a deep encoder, learnable prototypes, cross-entropy loss, and post-hoc temperature scaling rather than more complex auxiliary-loss designs.
  • Across evaluated datasets (CIFAR-10/100, SVHN, uniform noise, CIFAR-10C, and Tiny ImageNet features), VOLTA achieves competitive-to-superior accuracy while substantially reducing expected calibration error versus the baseline range.
  • VOLTA also demonstrates solid out-of-distribution detection performance (reported AUROC), supported by statistical testing across three random seeds and ablation studies emphasizing adaptive temperature and the deep encoder.
  • Overall, the results position VOLTA as a lightweight, deterministic, and well-calibrated alternative to more complex UQ pipelines for safety-critical deployment settings.

Abstract

Uncertainty quantification (UQ) is essential for deploying deep learning models in safety critical applications, yet no consensus exists on which UQ method performs best across different data modalities and distribution shifts. This paper presents a comprehensive benchmark of ten widely used UQ baselines including MC Dropout, SWAG, ensemble methods, temperature scaling, energy based OOD, Mahalanobis, hyperbolic classifiers, ENN, Taylor Sensus, and split conformal prediction against a simplified yet highly effective variant of VOLTA that retains only a deep encoder, learnable prototypes, cross entropy loss, and post hoc temperature scaling. We evaluate all methods on CIFAR 10 (in distribution), CIFAR 100, SVHN, uniform noise (out of distribution), CIFAR 10 C (corruptions), and Tiny ImageNet features (tabular). VOLTA achieves competitive or superior accuracy (up to 0.864 on CIFAR 10), significantly lower expected calibration error (0.010 vs. 0.044 to 0.102 for baselines), and strong OOD detection (AUROC 0.802). Statistical testing over three random seeds shows that VOLTA matches or outperforms most baselines, with ablation studies confirming the importance of adaptive temperature and deep encoders. Our results establish VOLTA as a lightweight, deterministic, and well calibrated alternative to more complex UQ approaches.