An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics

arXiv cs.AI / 4/17/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper argues that reliably automating “scientific novelty” evaluation is difficult because ground-truth novelty is hard to define and current metrics often rely on noisy proxies like citations or peer-review scores.
  • It proposes an axiomatic benchmark, specifying principles (axioms) that a good novelty metric should satisfy based on human scientific norms and practice.
  • The authors test existing novelty metrics across ten tasks in three AI research domains and find that no single metric consistently satisfies all axioms.
  • They show that metrics built on complementary architectural approaches can be combined to improve performance on the benchmark, and they release the benchmark code to support further research.
  • The results suggest that developing architecturally diverse novelty metrics is a promising path for building more trustworthy automated evaluation of scientific contributions.

Abstract

The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics for scientific papers generally validate their results against noisy, confounded signals such as citation counts or peer review scores. These proxies can conflate novelty with impact, quality, or reviewer preference, which in turn makes it harder to assess how well a given metric actually evaluates novelty. We therefore propose an axiomatic benchmark for scientific novelty metrics. We first define a set of axioms that a well-behaved novelty metric should satisfy, grounded in human scientific norms and practice, then evaluate existing metrics across ten tasks spanning three domains of AI research. Our results reveal that no existing metric satisfies all axioms consistently, and that metrics fail on systematically different axioms, reflecting their underlying architectures. Additionally, we show that combining metrics of complementary architectures leads to consistent improvements on the benchmark, with per-axiom weighting achieving 90.1% versus 71.5% for the best individual metric, suggesting that developing architecturally diverse metrics is a promising direction for future work. We release the benchmark code as supplementary material to encourage the development of more robust scientific literature novelty metrics.