Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a hierarchical proof search framework for automated code verification in Lean 4 that decomposes complex verification goals into simpler subgoals before tactic-level proving.
  • It introduces a principled decomposition score that serves as both the training reward and the inference-time ranking criterion to align optimization with deployment.
  • Goedel-Code-Prover-8B is trained via supervised initialization followed by hybrid reinforcement learning, using a single unified policy for both decomposition and completion.
  • On three Lean-based benchmarks totaling 427 tasks, the model achieves 62.0% prove success, a 2.6× improvement over the strongest baseline, and demonstrates scalable, efficient inference compared to larger neural provers.

Abstract

Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0\% prove success rate, a 2.6\times improvement over the strongest baseline, surpassing neural provers up to 84\times larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.