Riemann-Bench: A Benchmark for Moonshot Mathematics
arXiv cs.AI / 4/10/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Riemann-Bench, a private benchmark with 25 expert-curated, research-level mathematics problems intended to go beyond International Mathematical Olympiad (IMO) competition skills.
- Problems are created by Ivy League mathematics researchers and IMO medalists, typically taking authors weeks to solve, and are validated via double-blind independent expert verification with programmatic verifiers for unique closed-form solutions.
- The authors test frontier AI models as unconstrained research agents using coding tools and search, evaluating performance with an unbiased estimator over 100 independent runs per problem.
- Reported results show all evaluated frontier models score below 10%, highlighting a large gap between olympiad-style problem solving and true research-level mathematical reasoning.
- The benchmark is kept fully private to reduce the likelihood of memorization from training data and to better reflect genuine mathematical capability.
Related Articles

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014
Dev.to

Emergency Room and the Vanishing Moat
Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How
Dev.to