ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ReflectRMは、RLHFにおけるReward Modelの新パラダイムであるGenerative Reward Models(GRM)に対し、従来の「結果」中心の学習では不足していた「分析プロセス品質」を自分の自己反省(self-reflection)で評価する手法を提案している。
  • 反省によって信頼できる分析を推定し、その分析を根拠に最終的な選好(preference)予測を行うことで、応答選好と分析選好を同時に扱う「統一的な生成(unified generative)評価フレームワーク」を実現している。
  • 4つのベンチマークで一貫して性能向上が確認され、Qwen3-4Bで平均+3.7の精度向上を報告している。
  • 応答選好と分析選好は相互に強化し合うことを追加実験で示し、さらに位置バイアス(positional bias)を大幅に低減して、先行GRM比で+10.2の改善を達成したとされる。

Abstract

Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.