Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a common limitation in existing text-to-image diffusion models: a foreground bias that under-optimizes backgrounds, reducing global scene coherence and limiting compositional control.
  • It proposes a training-free sampling framework that explicitly models foreground–background interactions by restructuring diffusion inference rather than requiring model retraining.
  • Dynamic Spatial Guidance introduces a time-step-dependent gating mechanism to balance attention between foreground and background throughout the diffusion process.
  • Multi-Path Pruning uses multi-path latent exploration and dynamically filters candidate trajectories using attention statistics and external semantic alignment signals to better satisfy object–background constraints.
  • The authors introduce a benchmark for object–background compositionality and report consistent improvements across multiple diffusion backbones.

Abstract

Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.