Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
arXiv cs.AI / 4/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study examines whether composing small (1–3B) code-generation language models into pipelines can improve performance, focusing on the role of execution feedback versus pipeline complexity.
- Results on HumanEval (164) and sanitized MBPP (427), run locally on a single laptop, show that self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks.
- The mechanism of improvement is narrow: refinement substantially reduces runtime issues like NameError and SyntaxError but seldom fixes logic-level failures such as AssertionError.
- Within the tested model pool, refiner capability matters more than generator identity (e.g., a 1.5B generator with a 3B refiner can match a 3B model handling both roles), and early stopping is critical because additional iterations can become net-negative.
- Across a constrained architecture search using NEAT-inspired evolutionary methods, execution feedback is found to matter more than added pipeline topology, and model specialization outperforms general-purpose pipeline configurations; additionally, single-evaluation fitness can inflate results by 5–7%.
Related Articles

Legal Insight Transformation: 7 Mistakes to Avoid When Adopting AI Tools
Dev.to

Legal Insight Transformation: Traditional vs. AI-Driven Research Compared
Dev.to

Legal Insight Transformation: A Beginner's Guide to Modern Research
Dev.to
I tested the same prompt across multiple AI models… the differences surprised me
Reddit r/artificial
The five loops between AI coding and AI engineering
Dev.to