Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

arXiv cs.AI / 4/20/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces “Hard Mode” automated theorem proving, where an agent must first independently discover the answer before constructing a formal Lean 4 proof, rather than assuming the final result is embedded in the statement.
It releases Hard Mode benchmark variants (MiniF2F-Hard and FIMO-Hard) that have been expert re-annotated to support more realistic ATP evaluation.
It proposes the Discover And Prove (DAP) agentic framework that uses LLM natural-language reasoning with explicit self-reflection to find candidate answers, then rewrites Hard Mode problems into “Easy Mode” forms for existing ATP provers.
DAP achieves new state-of-the-art results by raising solved problems on CombiBench from 7 to 10 (Pass@16) and by being the first to formally prove 36 Putnam theorems in Hard Mode.
The authors also report a large performance gap: top LLMs exceed 80% accuracy on Hard Mode problems while formal provers manage under 10%, suggesting Hard Mode benchmarks better expose limitations relevant to real proof discovery.

Abstract

Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.

Black Hat USA

AI Business

Black Hat Asia

AI Business

Which Version of Qwen 3.6 for M5 Pro 24g

Reddit r/LocalLLaMA

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

Which Version of Qwen 3.6 for M5 Pro 24g

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer