Knowing when to trust machine-learned interatomic potentials

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current MLIP uncertainty quantification methods based on ensembles scale poorly for foundation-scale models and that ensemble disagreement is only weakly correlated with true per-molecule prediction error.
  • It proposes PROBE (Post-hoc Reliability frOm Backbone Embeddings), which turns uncertainty estimation into a post-hoc selective classification problem using a compact classifier on frozen per-atom representations from a pretrained MLIP.
  • PROBE outputs a per-prediction reliability probability that monotonically tracks actual prediction error without changing the underlying MLIP.
  • Evaluations on large held-out sets across two structurally different MLIP architectures show PROBE outperforms ensemble disagreement as a binary reliability signal, with stronger performance as the backbone representation becomes more expressive.
  • The approach is post-hoc, architecture-agnostic, directly deployable on any MLIP exposing per-atom representations, and it can produce chemically interpretable per-atom importance maps via multi-head self-attention at no additional compute cost.

Abstract

Prevailing machine-learned interatomic potential (MLIP) uncertainty-quantification methods rely on ensembles of independently trained backbones. These methods scale unfavorably with foundation-scale MLIPs, and their member-disagreement signals correlate weakly with per-molecule prediction error. Here we probe the frozen per-atom representations of a pretrained MLIP with a compact discriminative classifier, recasting MLIP uncertainty quantification as selective classification rather than error regression. The resulting method, PROBE (Post-hoc Reliability frOm Backbone Embeddings), produces a per-prediction reliability probability that monotonically tracks actual error without modification to the underlying model. Across large held-out evaluation sets and two structurally distinct MLIP architectures, PROBE outperforms ensemble disagreement as a binary reliability signal, which strengthens with the expressiveness of the backbone representation, implying a favorable scaling trajectory toward foundation-scale MLIPs. Multi-head self-attention additionally yields per-atom importance maps, providing chemically interpretable diagnostics at no additional computational cost. PROBE is post-hoc and architecture-agnostic, and is directly deployable on any MLIP that exposes per-atom representations.