ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a shift from “learning to answer” to “learning to question,” aiming for a language model to generate verifiable problem specifications, solve them, and use solver feedback for self-improvement without human supervision.
It introduces ANCORA, an anchored-curriculum self-play framework that alternates a Proposer (creates new specs) and a Solver (generates verified solutions), with training stabilized by a two-level group-relative update, iterative self-distilled SFT, and a UCB-guided Curriculum DAG that only grows via strictly filtered, novel, verifier-verified specifications.
The authors argue these stabilization mechanisms are necessary because sparse verifier feedback can cause Proposer collapse even when rewards are aligned with MLRL.
Experiments using Verus show a major improvement in Dafny2Verus test-time training: pass@1 rises from 26.6% (SFT baseline) to 81.5% under 0-shot evaluation, beating a PSV self-play baseline by 15.8 points despite PSV using 1-shot inference.
In a transfer setting initialized from Dafny2Verus seeds, the method achieves 36.2% pass@1 on held-out MBPP and 17.2% on HumanEval.

Abstract

We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in which a unified policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions. ANCORA rests on three load-bearing mechanisms: a two-level group-relative update that couples Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT that projects the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG that grows only through strictly filtered, novel, Solver-verified specifications. These stabilizers are necessary because sparse verifier feedback otherwise drives Proposer collapse even under MLRL-aligned rewards. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in the test-time-training setting under 0-shot evaluation, outperforming the PSV self-play baseline by 15.8 points despite PSV using 1-shot inference; in a separate transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Why Enterprise AI Pilots Fail

Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)

Dev.to

How to Fix OpenClaw Tool Calling Issues

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Why Enterprise AI Pilots Fail

The PDF Feature Nobody Asked For (That I Use Every Day)

How to Fix OpenClaw Tool Calling Issues

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer