Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

arXiv cs.LG / 4/21/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that end-to-end fine-tuning for LLM-based SWE agents often depends on binary terminal rewards (e.g., unit tests passing), which offer little guidance for improving intermediate steps in multi-turn problem solving.
It proposes a rubric-based Generative Reward Model (GRM) that uses human-designed rubrics to encourage or discourage specific behavioral patterns during interactions, generating denser learning signals than terminal-only feedback.
The method collects higher-quality training data through trajectory filtration, leveraging the rubric-driven feedback to select more informative trajectories.
In reinforced fine-tuning on SWE tasks, the rubric-based GRM with RFT outperforms terminal-score-only rejection sampling by better suppressing undesirable behaviors, promoting beneficial ones, and improving final unit test accuracy.
The work is presented as an arXiv new announcement (arXiv:2604.16335v1), indicating a fresh research contribution rather than an incremental update to an existing product.

Abstract

Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

Where is Grok-2 Mini and Grok-3 (mini)?

Reddit r/LocalLLaMA

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

Key Points

Abstract

Related Articles

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Where is Grok-2 Mini and Grok-3 (mini)?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer