V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

V-GRPO (Variational GRPO) addresses the challenge of aligning denoising generative diffusion models with preferences or verifiable rewards using online reinforcement learning while avoiding intractable likelihoods.
The authors show that an ELBO-based likelihood surrogate approach can be made both stable and efficient by reducing surrogate variance and carefully controlling gradient step sizes.
V-GRPO combines ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm and adds a set of simple implementation techniques.
Experiments indicate V-GRPO delivers state-of-the-art results for text-to-image synthesis and improves runtime efficiency, including ~2× faster than MixGRPO and ~3× faster than DiffusionNFT.
The method is designed to be easy to implement, align with pretraining objectives, and circumvent inefficiencies associated with MDP-based optimization over sampling trajectories.

Abstract

Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a

2\times

speedup over MixGRPO and a

3\times

speedup over DiffusionNFT.

What to Build Still Beats How

Dev.to

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

Dev.to

v0.22.1

Ollama Releases

AI created job descriptions

Reddit r/artificial

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

Dev.to

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Key Points

Abstract

Related Articles

What to Build Still Beats How

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

v0.22.1

AI created job descriptions

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer