Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key failure mode in math RLVR where hint-based training can “sharpen” the solution distribution and narrow coverage on hard problems, harming large-$k$ performance.
  • It proposes Distribution-Aligned Hint Synthesis (DAHS), which generates verified teacher hints conditioned on the student’s style of responses to reduce teacher–student distribution mismatch.
  • It also introduces Backward Hint Annealing (BHA), which gradually reduces hint exposure across difficulty buckets and applies per-question hint dropout so that no-hint updates remain available throughout RL training.
  • Experiments on AIME24/25/26 using Qwen3-1.7B-Base and Llama-3.2-1B-Instruct under the DAPO RLVR framework show improved pass@1 and pass@2048 for Qwen, while Llama gains are mainly in the large-$k$ setting.
  • The results suggest that effective math RLVR with hints requires early scaffolding to enable learning on challenging questions, followed by systematic reduction of hints before evaluation under no-hint settings.

Abstract

Reinforcement learning with verifiable rewards (RLVR) can improve low-k reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-k performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using \texttt{Qwen3-1.7B-Base} and \texttt{Llama-3.2-1B-Instruct}. On \texttt{Qwen3-1.7B-Base}, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On \texttt{Llama-3.2-1B-Instruct}, the gains are concentrated in the large-k regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.