Self-Execution Simulation Improves Coding Models

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes training coding LLMs to estimate and simulate program execution step-by-step to address failures in predicting how generated code will run.
It combines supervised fine-tuning on real execution traces with reinforcement learning that uses verifiable rewards, grounding explanations in true execution.
The method uses two objectives—predicting outputs from code and inputs and solving competitive programming problems using either ground-truth or self-predicted execution feedback.
By simulating execution, the model can self-verify across multiple candidate solutions and iteratively self-fix through test execution loops.
Experiments on multiple competitive programming benchmarks show consistent gains versus standard reasoning approaches, alongside ablations highlighting both the benefits and limitations of execution simulation.

Abstract

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.