WybeCoder: Verified Imperative Code Generation

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

本論文は、ソフトウェア検証の改善がLLMによるコード生成や定理証明ほど進んでいない点を背景に、コード生成と不変条件・証明を同時に育てる「prove-as-you-generate」型のエージェント的検証フレームワークWybeCoderを提案しています。
WybeCoderは、検証条件生成とSMTソルバに加え、Leanでの対話型証明を組み合わせており、コード・不変条件・証明が相互に更新される“co-evolve”を中核に据えています。
検証の体系的評価のため、Leanベースの2つの機能的検証ベンチマーク（VerinaとClever）を、等価な命令型（imperative）コード仕様へ翻訳することで比較可能な実験環境を整えています。
Heapsortのような複雑アルゴリズムで、スケールさせることで有効な不変条件やサブゴールを多数生成し、従来の到達率の停滞（plateau）を超えて、目標タスクの達成率（Verina 74%、Clever 62%）を中程度の計算予算で大幅に向上させたと報告しています。

Abstract

Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.