AI Navigate

When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a selective prediction framework for molecular structure retrieval from MS/MS spectra that allows models to abstain from predictions when uncertainty is too high.
  • It evaluates uncertainty quantification strategies at two granularity levels: fingerprint-level uncertainty over predicted molecular fingerprint bits and retrieval-level uncertainty over candidate rankings.
  • The study compares scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty from second-order distributions, and distance-based measures in the latent space.
  • It reports that fingerprint-level uncertainty scores are poor proxies for retrieval success, while retrieval-level aleatoric uncertainty and simple first-order confidence yield strong risk-coverage tradeoffs across evaluation settings, and shows how distribution-free risk control via generalization bounds can specify a tolerable error rate with high-probability trusted annotations.

Abstract

Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.