Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

研究者らは、プロセス段階の評価をGRPOに統合する手法としてProcess-Aware Policy Optimization（PAPO）を提案し、既存報酬設計の2つの弱点に対処すると述べています。
PAPOは、最終回答の正しさを担うアウトカム成分Aout（ORM由来）と、推論の質を担うプロセス成分Aproc（ルーブリックPRM由来）を“別々に正規化”して合成することで、アウトカムの優位信号の弱まりとプロセス報酬の報酬ハッキングを同時に抑える設計です。
Aoutは全応答に対して正規化し、Aprocは正解応答のみに対して正規化することで、最終正解の学習アンカーを崩さずに推論品質を分別できると説明しています。
複数のモデル規模と6つのベンチマークでPAPOが一貫してORMを上回り、OlympiadBenchで51.3%対46.3%を達成したほか、ORMが頭打ちや低下に入った後も改善が続くと報告されています。

Abstract

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.