Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards
arXiv cs.LG / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a lightweight reinforcement-learning framework for autoregressive image generation that formulates token-based synthesis as a Markov Decision Process and optimizes with Group Relative Policy Optimization (GRPO).
- It introduces a distribution-level Leave-One-Out FID (LOO-FID) reward computed using an exponential moving average of feature moments to explicitly promote diversity and mitigate mode collapse, addressing shortcomings of instance-only reward RL.
- The method combines the distribution-level diversity reward with composite instance-level rewards using CLIP and HPSv2 to maintain semantic and perceptual fidelity.
- Training is further stabilized for multi-objective optimization via an adaptive entropy regularization term.
- Experiments on LlamaGen and VQGAN show improved quality-and-diversity metrics in only a few hundred tuning iterations, with competitive results even without Classifier-Free Guidance and reduced inference cost.
広告
Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

The Redline Economy
Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks
Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists
Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure
Dev.to