ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a shift from “learning to answer” to “learning to question,” aiming for a language model to generate verifiable problem specifications, solve them, and use solver feedback for self-improvement without human supervision.
  • It introduces ANCORA, an anchored-curriculum self-play framework that alternates a Proposer (creates new specs) and a Solver (generates verified solutions), with training stabilized by a two-level group-relative update, iterative self-distilled SFT, and a UCB-guided Curriculum DAG that only grows via strictly filtered, novel, verifier-verified specifications.
  • The authors argue these stabilization mechanisms are necessary because sparse verifier feedback can cause Proposer collapse even when rewards are aligned with MLRL.
  • Experiments using Verus show a major improvement in Dafny2Verus test-time training: pass@1 rises from 26.6% (SFT baseline) to 81.5% under 0-shot evaluation, beating a PSV self-play baseline by 15.8 points despite PSV using 1-shot inference.
  • In a transfer setting initialized from Dafny2Verus seeds, the method achieves 36.2% pass@1 on held-out MBPP and 17.2% on HumanEval.

Abstract

We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in which a unified policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions. ANCORA rests on three load-bearing mechanisms: a two-level group-relative update that couples Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT that projects the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG that grows only through strictly filtered, novel, Solver-verified specifications. These stabilizers are necessary because sparse verifier feedback otherwise drives Proposer collapse even under MLRL-aligned rewards. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in the test-time-training setting under 0-shot evaluation, outperforming the PSV self-play baseline by 15.8 points despite PSV using 1-shot inference; in a separate transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.