SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

arXiv cs.AI / 4/23/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper identifies a “Mental-Reality Gap” in LLM code generation where models hallucinate execution traces, leading to confident validation of incorrect code.
SolidCoder is proposed to “don’t imagine—execute,” addressing both specification gaps (missing edge cases) and verification gaps (inventing correct behavior for buggy code).
The SOLID architecture uses edge-case awareness before algorithm design and replaces imagined traces with sandboxed execution guided by property-based oracles.
Experiments with GPT-4o show state-of-the-art results, including 95.7% pass@1 on HumanEval, 77.0% on CodeContests, and 26.7% on APPS, with ablation indicating edge-case awareness is the biggest driver.
The approach generalizes to RL post-trained models and the authors release the code/framework to support further research.

Abstract

State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap -- where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code). We propose SolidCoder with a simple principle: don't imagine -- execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1 performance: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.

Black Hat USA

AI Business

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

Key Points

Abstract

Related Articles

Black Hat USA

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer