Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

arXiv cs.LG / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper argues that end-to-end fine-tuning for LLM-based SWE agents often depends on binary terminal rewards (e.g., unit tests passing), which offer little guidance for improving intermediate steps in multi-turn problem solving.
  • It proposes a rubric-based Generative Reward Model (GRM) that uses human-designed rubrics to encourage or discourage specific behavioral patterns during interactions, generating denser learning signals than terminal-only feedback.
  • The method collects higher-quality training data through trajectory filtration, leveraging the rubric-driven feedback to select more informative trajectories.
  • In reinforced fine-tuning on SWE tasks, the rubric-based GRM with RFT outperforms terminal-score-only rejection sampling by better suppressing undesirable behaviors, promoting beneficial ones, and improving final unit test accuracy.
  • The work is presented as an arXiv new announcement (arXiv:2604.16335v1), indicating a fresh research contribution rather than an incremental update to an existing product.

Abstract

Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.