AI Navigate

Alternating Reinforcement Learning with Contextual Rubric Rewards

arXiv cs.AI / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), which replaces scalar rewards with multi-dimensional, rubric-based evaluations to better capture objective correlations in RL tasks.
  • ARL-RR avoids fixed scalarization by optimizing one semantic rubric meta-class at a time and uses a lightweight, search-based adaptation procedure to dynamically select the next meta-class based on task performance.
  • The authors provide theoretical insights showing that traditional reward aggregation can cause variance contraction, and that their alternating rubric approach helps explain the observed performance gains.
  • Empirical results on the HealthBench dataset with expert annotations show ARL-RR uniformly outperforms scalarized methods across model sizes (1.7B, 4B, 8B, 14B) in both model performance and training efficiency.

Abstract

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).