Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper proposes a self-play training framework for LLMs that targets semantic equivalence of Haskell code, using formal verification to steer adversarial generator–evaluator learning.
  • It validates equivalence with Liquid Haskell proofs and obtains execution-based counterexamples when programs are not equivalent, with a difficulty-aware curriculum to control training progression.
  • The authors release OpInstruct-HSx, a synthetic dataset of ~28k validated Haskell programs, to support training and benchmarking.
  • Experiments report strong downstream transfer, including up to a 13.3 percentage-point accuracy gain on EquiBench and consistent improvements on PySecDB, with ablations suggesting equivalence proofs are key for reasoning.
  • The full training pipeline and dataset are published on GitHub and Hugging Face, enabling replication and further research.

Abstract

We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of \approx28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.