Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key limitation of diffusion distillation methods like Distribution Matching Distillation (DMD): they can improve sampling speed but may degrade generation quality.
It argues that simply combining reinforcement learning (RL) with distillation can produce unreliable and conflicting reward signals because raw sample evaluation is noisy and misaligned with the distillation trajectory.
To fix this, the authors propose GDMD (Guiding Distribution Matching Distillation), which changes the reward mechanism to prioritize distillation gradients rather than raw pixel outputs.
By reinterpreting DMD gradients as implicit target tensors, GDMD lets existing reward models evaluate the quality of distillation updates directly and uses gradient-level guidance as adaptive weighting to prevent optimization divergence.
Experiments report a new state of the art in few-step generation, with 4-step models outperforming their multi-step teachers and beating prior DMD/R results across GenEval and human-preference metrics, with strong scalability potential.

Abstract

Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.