Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

arXiv cs.CV / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Golden RPG, a region-aware noise prediction method for compositional text-to-image generation that improves prompt fidelity when multiple sub-prompts target spatially separated regions.
It extends a frozen NPNet with per-region FiLM adapters and a Region Cross-Attention layer to let different image locations attend to different sub-prompt tokens.
To avoid harming performance on easier prompts, the method uses a Confidence-Adaptive Blending head that adaptively controls how strongly regional conditioning overrides global noise.
Experiments on RPG (20 prompts) and T2I-CompBench (1,200 images across multiple categories) show Golden RPG achieves the best cross-region coherence while matching top baselines on CLIP-based quality metrics, and a user study finds ~67% preference.
The approach is lightweight, with about 2M trainable parameters and only ~0.6 seconds of additional inference time on top of SDXL.

Abstract

Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of the Swin backbone, allowing different spatial locations to attend to different sub-prompt tokens. To prevent the regional conditioning from degrading samples whose prompts are already easy, we further propose a \textbf{Confidence-Adaptive Blending} head that dynamically predicts, per sample, how strongly the regional signal should override the global signal. We evaluate on the original RPG benchmark (20 prompts, 100 samples) and on four multi-region categories of T2I-CompBench (1{,}200 images, six competing methods). Golden RPG achieves the highest Cross-Region-Coherence score on every category, while matching the strongest baselines on absolute CLIP-Score and CLIP-IQA. A paired user study further shows a

\boldsymbol{\sim}

67\% preference over the strongest baseline. The adapter contains

\sim

2M trainable parameters and adds only

0.6

\,s of inference overhead on top of SDXL.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

Key Points

Abstract

💡 Insights using this article

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer