SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

arXiv cs.AI / 4/23/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper identifies a “Mental-Reality Gap” in LLM code generation where models hallucinate execution traces, leading to confident validation of incorrect code.
  • SolidCoder is proposed to “don’t imagine—execute,” addressing both specification gaps (missing edge cases) and verification gaps (inventing correct behavior for buggy code).
  • The SOLID architecture uses edge-case awareness before algorithm design and replaces imagined traces with sandboxed execution guided by property-based oracles.
  • Experiments with GPT-4o show state-of-the-art results, including 95.7% pass@1 on HumanEval, 77.0% on CodeContests, and 26.7% on APPS, with ablation indicating edge-case awareness is the biggest driver.
  • The approach generalizes to RL post-trained models and the authors release the code/framework to support further research.

Abstract

State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap -- where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code). We propose SolidCoder with a simple principle: don't imagine -- execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1 performance: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.