Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a lightweight reinforcement-learning framework for autoregressive image generation that formulates token-based synthesis as a Markov Decision Process and optimizes with Group Relative Policy Optimization (GRPO).
It introduces a distribution-level Leave-One-Out FID (LOO-FID) reward computed using an exponential moving average of feature moments to explicitly promote diversity and mitigate mode collapse, addressing shortcomings of instance-only reward RL.
The method combines the distribution-level diversity reward with composite instance-level rewards using CLIP and HPSv2 to maintain semantic and perceptual fidelity.
Training is further stabilized for multi-objective optimization via an adaptive entropy regularization term.
Experiments on LlamaGen and VQGAN show improved quality-and-diversity metrics in only a few hundred tuning iterations, with competitive results even without Classifier-Free Guidance and reduced inference cost.

Abstract

Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.