Riemann-Bench: A Benchmark for Moonshot Mathematics

arXiv cs.AI / 4/10/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Riemann-Bench, a private benchmark with 25 expert-curated, research-level mathematics problems intended to go beyond International Mathematical Olympiad (IMO) competition skills.
  • Problems are created by Ivy League mathematics researchers and IMO medalists, typically taking authors weeks to solve, and are validated via double-blind independent expert verification with programmatic verifiers for unique closed-form solutions.
  • The authors test frontier AI models as unconstrained research agents using coding tools and search, evaluating performance with an unbiased estimator over 100 independent runs per problem.
  • Reported results show all evaluated frontier models score below 10%, highlighting a large gap between olympiad-style problem solving and true research-level mathematical reasoning.
  • The benchmark is kept fully private to reduce the likelihood of memorization from training data and to better reflect genuine mathematical capability.

Abstract

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10\%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.