Riemann-Bench: A Benchmark for Moonshot Mathematics

arXiv cs.AI / 4/10/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Riemann-Bench, a private benchmark with 25 expert-curated, research-level mathematics problems intended to go beyond International Mathematical Olympiad (IMO) competition skills.
Problems are created by Ivy League mathematics researchers and IMO medalists, typically taking authors weeks to solve, and are validated via double-blind independent expert verification with programmatic verifiers for unique closed-form solutions.
The authors test frontier AI models as unconstrained research agents using coding tools and search, evaluating performance with an unbiased estimator over 100 independent runs per problem.
Reported results show all evaluated frontier models score below 10%, highlighting a large gap between olympiad-style problem solving and true research-level mathematical reasoning.
The benchmark is kept fully private to reduce the likelihood of memorization from training data and to better reflect genuine mathematical capability.

Abstract

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10\%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Dev.to

Emergency Room and the Vanishing Moat

Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

Dev.to

Riemann-Bench: A Benchmark for Moonshot Mathematics

Key Points

Abstract

Related Articles

GLM 5.1 tops the code arena rankings for open models

can we talk about how AI has gotten really good at lying to you?

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Emergency Room and the Vanishing Moat

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer