From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

arXiv cs.LG / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes how Chain-of-Thought (CoT) exploration and reinforcement learning (RL) optimization interact in autoregressive text-to-image generation, showing exploration expands token space while RL narrows toward high-reward regions.
It finds final reward is strongly negatively correlated with both the mean and variance of image-token entropy, implying that reducing uncertainty/instability is critical for better outcomes.
The authors show that the entropy of the textual CoT meaningfully determines downstream image quality, where lower-entropy CoTs produce better generations.
Based on these insights, they introduce Entropy-Guided Group Relative Policy Optimization (EG-GRPO), which adjusts fine-tuning updates by excluding low-entropy tokens from reward-driven updates and adding an entropy bonus to high-entropy tokens.
Experiments on standard text-to-image benchmarks report state-of-the-art performance for EG-GRPO, indicating improved stability and generation quality through entropy-guided optimization.

Abstract

Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.