Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies co-evolutionary self-play for LLM curriculum learning, where one model proposes math problems and another solves them, and shows that training can suffer from “diversity collapse” where the proposer converges to a narrow, reward-satisfying problem distribution.
  • It introduces “vocabulary dropout,” a hard, non-stationary random masking of the proposer’s output logits during both policy training and curriculum generation to keep the proposer from locking into fixed token sequences.
  • Experiments training Qwen3-4B and Qwen3-8B via R-Zero on mathematical reasoning indicate that vocabulary dropout preserves proposer diversity across lexical, semantic, and functional metrics throughout training.
  • The approach improves the solver by an average of +4.4 points for the 8B model, with the largest gains on competition-level benchmarks.
  • The authors argue that adding explicit action-space constraints—analogous to game rules in classical self-play—can sustain productive co-evolution and make the curriculum informative for the solver.

Abstract

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.