A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

arXiv cs.LG / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper explores scaling reinforcement learning (RL) for code generation and argues that performance limits come more from data diversity and structure than raw data volume.
  • It introduces a scalable multi-turn synthetic data generation pipeline where a “teacher” model iteratively refines tasks using in-context summaries of a student model’s performance, without teacher fine-tuning.
  • Compared with single-turn generation, the multi-turn approach yields more valid synthetic problems and creates structured difficulty progressions (“stepping stones”) that enable curriculum-based RL training.
  • Experiments across Llama3.1-8B Instruct and Qwen3-8B Base (and additional runs with Qwen2.5-32B) analyze how task difficulty, curriculum scheduling, and environment diversity jointly affect RL training dynamics.
  • Results indicate synthetic augmentation improves in-domain code performance and, in most cases, boosts out-of-domain math performance, with empirical guidance on curriculum and diversity design.

Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.