Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Stochastic Attention,” an inference-time modification for transformer-based scientific foundation models that randomizes attention to better support calibrated predictive uncertainty.
  • Instead of deterministic softmax attention weights, it uses normalized multinomial sampling governed by a single concentration parameter, enabling predictive ensembles without retraining.
  • The authors propose a calibration objective to set the concentration parameter via an efficient univariate post-hoc tuning process that aligns stochastic outputs with targets.
  • Experiments on weather and time-series forecasting foundation models (plus another regression task) show improved calibration and sharper prediction intervals compared with uncertainty-aware baselines.
  • The approach is computationally efficient, needing only minutes of post-hoc tuning to reach competitive performance, versus days of retraining for comparable baseline methods.

Abstract

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.