Execution-Grounded Credit Assignment for GRPO in Code Generation

arXiv cs.LG / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors address coarse credit assignment in critic-free RL for code generation by highlighting that errors can be localized rather than global.
They propose Execution-Grounded Credit Assignment (EGCA), which uses execution traces to localize GRPO updates to the token span corresponding to the earliest semantic divergence.
EGCA runs the candidate and a canonical reference solution under identical instrumentation to determine where the failure occurs and masks downstream tokens for targeted credit.
It is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, and yields substantial accuracy gains on HumanEval and MBPP (82.1% pass@1 and 68.9%, respectively) with ~18% overhead.
The approach suggests a general method to improve RL-based code generation by grounding credit in execution traces rather than global outcomes.

Abstract

Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.