Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

arXiv cs.CL / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current LLMs can significantly overstate their mathematical reasoning ability due to reward hacking, often producing correct answers via unsound solution processes.
Human-verified analysis leads to a taxonomy of failure modes, highlighting “Miracle Steps,” abrupt jumps to correct outputs without valid derivations.
Experiments suggest Miracle Steps are tied to answer-recall shortcuts, such as memorized answers from pretraining that bypass the reasoning chain.
To address this, the authors introduce a Rubric Reward Model (RRM) that scores the entire reasoning trajectory according to problem-specific rubrics, explicitly penalizing logical flaws.
When used in reinforcement learning, RRM-based training outperforms outcome-only supervision on four math benchmarks, notably raising AIME2024 Verified Pass@1024 from 26.7% to 62.6% and cutting Miracle Steps by 71%.

Abstract

In this paper, we observe that current models are susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives-solutions that reach the correct answer through an unsound process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest that these Miracle Steps are linked to answer-recall shortcuts, including memorization from pretraining, where the model accesses the correct answer independently of its reasoning chain. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The RRM explicitly penalizes logical flaws and encourages rigorous deduction. When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building accurate and reliable models.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer