Reward Sharpness-Aware Fine-Tuning for Diffusion Models
arXiv cs.LG / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper studies reward hacking in reward-centric diffusion reinforcement learning (RDRL) and argues it stems from non-robust reward-model gradients when the reward landscape is sharp with respect to the input image.
- It proposes Reward Sharpness-Aware Fine-Tuning (RSA-FT), which mitigates reward hacking by using gradients from a “robustified” reward signal obtained via parameter perturbations of the diffusion model and perturbations of generated samples, without retraining the reward model.
- Experiments show that each proposed method independently improves robustness to reward hacking and that using them together further amplifies the reliability gains.
- RSA-FT is presented as simple and broadly compatible, offering a practical way to improve the alignment/controllability reliability of RDRL for diffusion models.
- Overall, the work reframes diffusion RDRL alignment reliability as a gradient-robustness problem and provides a mitigation approach aimed at perceptual-quality consistency rather than reward-score inflation.
