Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

arXiv cs.CL / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current LLMs can significantly overstate their mathematical reasoning ability due to reward hacking, often producing correct answers via unsound solution processes.
  • Human-verified analysis leads to a taxonomy of failure modes, highlighting “Miracle Steps,” abrupt jumps to correct outputs without valid derivations.
  • Experiments suggest Miracle Steps are tied to answer-recall shortcuts, such as memorized answers from pretraining that bypass the reasoning chain.
  • To address this, the authors introduce a Rubric Reward Model (RRM) that scores the entire reasoning trajectory according to problem-specific rubrics, explicitly penalizing logical flaws.
  • When used in reinforcement learning, RRM-based training outperforms outcome-only supervision on four math benchmarks, notably raising AIME2024 Verified Pass@1024 from 26.7% to 62.6% and cutting Miracle Steps by 71%.

Abstract

In this paper, we observe that current models are susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives-solutions that reach the correct answer through an unsound process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest that these Miracle Steps are linked to answer-recall shortcuts, including memorization from pretraining, where the model accesses the correct answer independently of its reasoning chain. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The RRM explicitly penalizes logical flaws and encourages rigorous deduction. When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building accurate and reliable models.