CompleteRXN: Toward Completing Open Chemical Reaction Databases

arXiv cs.LG / 5/4/2026

📰 NewsModels & Research

Key Points

  • Existing chemical reaction datasets like USPTO are significantly incomplete, often missing byproducts, co-reactants, and stoichiometric information, which undermines downstream reliability.
  • The article introduces CompleteRXN, a large-scale supervised benchmark designed for reaction completion under realistic missing-data conditions by mapping USPTO records to curated mechanistic reactions and enforcing atom-balanced, aligned pairs.
  • Evaluations compare multiple baselines, including a constrained encoder-decoder reaction completion model, the Constrained Reaction Balancer (CRB), and SynRBL, showing that performance worsens as incompleteness increases.
  • CRB achieves the strongest benchmark results, reaching 99.20% equivalence accuracy on a random split and 91.12% on an extreme out-of-distribution split.
  • When tested on reactions outside the benchmark (full uncurated USPTO), accuracy drops substantially across methods, underscoring a gap between benchmark scores and practical robustness and motivating future improvements.

Abstract

Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.