From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

arXiv cs.AI / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLMs can aid software engineering, but their generated code often fails correctness requirements due to errors and hallucinations.
  • It introduces the NL2VC-60 dataset (60 complex algorithmic problems) and studies how to convert informal natural-language problem statements into precise, Dafny-based formal specifications plus implementation logic.
  • Across seven open-weight LLMs, the authors compare tiered prompting methods (contextless, signature-guided, and iterative self-healing using Dafny verifier feedback) and find that contextless prompting performs very poorly.
  • Adding structural “signatures” and Dafny-driven self-healing yields large improvements, including high verification success for Gemma 4-31B (90.91%) and a major jump for GPT-OSS 120B (from 0 to 81.82%) when guided by signature-based feedback.
  • To prevent vacuous proofs (where models pass with trivial specs), the work uses uDebug for functional validation, aiming for higher assurance beyond mere verifier acceptance.

Abstract

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.