REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

arXiv cs.LG / 3/19/2026

📰 NewsModels & Research

共有:

Key Points

REAL introduces a regression-aware RL framework for LLM evaluation that optimizes regression-based rewards instead of binary signals.
It tackles the policy-dependency of regression objectives by using a generalized policy gradient estimator, decomposing optimization into exploration over Chain-of-Thought trajectories and regression-aware score refinement.
Experimental results across model scales (8B to 32B) show REAL consistently outperforming regression-aware SFT baselines and standard RL methods, with notable gains on Qwen3-32B (Pearson +8.40, Spearman +7.20).
The findings highlight improved generalization to out-of-domain benchmarks and demonstrate the value of integrating regression objectives into RL exploration for more accurate LLM evaluation.

Abstract

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

Self-Refining Agents in Spec-Driven Development

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks

Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights

Reddit r/LocalLLaMA

Best open source coding models for claude code? LB?

Reddit r/LocalLLaMA

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

M2.7 open weights coming in ~2 weeks

MiniMax M2.7 Will Be Open Weights

Best open source coding models for claude code? LB?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer