Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

arXiv cs.CV / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Golden RPG, a region-aware noise prediction method for compositional text-to-image generation that improves prompt fidelity when multiple sub-prompts target spatially separated regions.
  • It extends a frozen NPNet with per-region FiLM adapters and a Region Cross-Attention layer to let different image locations attend to different sub-prompt tokens.
  • To avoid harming performance on easier prompts, the method uses a Confidence-Adaptive Blending head that adaptively controls how strongly regional conditioning overrides global noise.
  • Experiments on RPG (20 prompts) and T2I-CompBench (1,200 images across multiple categories) show Golden RPG achieves the best cross-region coherence while matching top baselines on CLIP-based quality metrics, and a user study finds ~67% preference.
  • The approach is lightweight, with about 2M trainable parameters and only ~0.6 seconds of additional inference time on top of SDXL.

Abstract

Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of the Swin backbone, allowing different spatial locations to attend to different sub-prompt tokens. To prevent the regional conditioning from degrading samples whose prompts are already easy, we further propose a \textbf{Confidence-Adaptive Blending} head that dynamically predicts, per sample, how strongly the regional signal should override the global signal. We evaluate on the original RPG benchmark (20 prompts, 100 samples) and on four multi-region categories of T2I-CompBench (1{,}200 images, six competing methods). Golden RPG achieves the highest Cross-Region-Coherence score on every category, while matching the strongest baselines on absolute CLIP-Score and CLIP-IQA. A paired user study further shows a \boldsymbol{\sim}67\% preference over the strongest baseline. The adapter contains \sim2M trainable parameters and adds only 0.6\,s of inference overhead on top of SDXL.