Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
arXiv cs.CV / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a common limitation in existing text-to-image diffusion models: a foreground bias that under-optimizes backgrounds, reducing global scene coherence and limiting compositional control.
- It proposes a training-free sampling framework that explicitly models foreground–background interactions by restructuring diffusion inference rather than requiring model retraining.
- Dynamic Spatial Guidance introduces a time-step-dependent gating mechanism to balance attention between foreground and background throughout the diffusion process.
- Multi-Path Pruning uses multi-path latent exploration and dynamically filters candidate trajectories using attention statistics and external semantic alignment signals to better satisfy object–background constraints.
- The authors introduce a benchmark for object–background compositionality and report consistent improvements across multiple diffusion backbones.



