FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

FormalProofBenchは、自然言語の数学問題とLean 4の形式文を対応させ、モデルがLean 4チェッカーで受理される形式証明を出せるかを評価する非公開ベンチマークを提案しています。
対象は解析・代数・確率・論理など幅広く、上級学部から大学院レベルの問題（qualifying examsや教科書）を用いています。
複数の最先端基盤モデルをエージェント型の実行ハーネスで評価した結果、最高性能モデルでも正解率は33.5%にとどまり、その後は急速に低下したと報告されています。
精度に加えて、ツール利用状況、失敗モード、コスト、レイテンシなども分析し、形式定理証明における現状の能力と限界を包括的に示しています。

Abstract

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

Why AI agent teams are just hoping their agents behave

Dev.to

Harness as Code: Treating AI Workflows Like Infrastructure

Dev.to

How to Make Claude Code Better at One-Shotting Implementations

Towards Data Science

The Crypto AI Agent Stack That Costs $0/Month to Run

Dev.to

Bag of Freebies for Training Object Detection Neural Networks

Dev.to

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Key Points

Abstract

Related Articles

Why AI agent teams are just hoping their agents behave

Harness as Code: Treating AI Workflows Like Infrastructure

How to Make Claude Code Better at One-Shotting Implementations

The Crypto AI Agent Stack That Costs $0/Month to Run

Bag of Freebies for Training Object Detection Neural Networks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer