AI Navigate

GASP: Guided Asymmetric Self-Play For Coding LLMs

arXiv cs.LG / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • GASP introduces Guided Asymmetric Self-Play, a grounding mechanism for self-play in coding LLMs that uses real-data goalpost questions to steer exploration.
  • During training, a teacher first generates an easier variant of a hard question and then a harder variant, gradually closing the gap to the goalpost.
  • Compared with unguided self-play, GASP achieves a 2.5 percentage point improvement in pass@20 on LiveCodeBench and enables solving hard goalpost questions that baselines cannot reach.
  • By grounding the curriculum in real tasks rather than pure difficulty, the approach addresses the problem of uninformative hard problems in prior asymmetric self-play.
  • The paper suggests that such grounded curricula can lead to more efficient post-training data generation for coding LLMs and better handling of hard problem distributions.

Abstract

Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.