Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

arXiv cs.CL / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces “Rubrics to Tokens (RTT),” a rubric-based reinforcement learning framework aimed at improving LLM alignment for open-domain instruction following tasks.
It addresses reward sparsity and ambiguity by moving from coarse response-level rewards to fine-grained token-level credit assignment using a Token-Level Relevance Discriminator.
RTT-GRPO is proposed to unify response-level and token-level advantages in a single optimization framework for the policy model.
To handle a shift from one-dimensional outcome rewards to a three-dimensional token-level rubric reward space, the authors propose “Intra-sample Token Group Normalization.”
Reported experiments and benchmarks indicate RTT achieves higher instruction-level and rubric-level accuracy than existing baselines across multiple models.

Abstract

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.