Leveraging Verifier-Based Reinforcement Learning in Image Editing

arXiv cs.CV / 5/1/2026

📰 NewsModels & Research

Key Points

  • The paper argues that image editing with reinforcement learning needs more robust, general reward modeling than existing edit reward models that provide only coarse overall scores.
  • It introduces Edit-R1 and its core Edit-RRM, which uses a chain-of-thought “reasoning verifier” to break instructions into principles, check the edited image against each, and produce interpretable fine-grained rewards.
  • To build the verifier-based reward model, the authors use supervised fine-tuning to bootstrap CoT reward trajectories, then train with Group Contrastive Preference Optimization (GCPO) using human pairwise preferences.
  • They further train image editing models with GRPO using the resulting non-differentiable reward model, and experiments show Edit-RRM outperforms strong VLMs (e.g., Seed-1.5-VL, Seed-1.6-VL) as an editing-specific reward model.
  • The work reports consistent scaling improvements from 3B to 7B parameters and demonstrates that Edit-R1 improves editing models such as FLUX.1-kontext.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.