Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

arXiv cs.CV / 4/29/2026

📰 NewsModels & Research

Key Points

  • The paper addresses a key challenge in evaluating layout-guided text-to-image diffusion models: measuring both semantic alignment to prompts and spatial fidelity to layouts, which is hard due to costly fine-grained annotations.
  • It introduces two benchmarks: a closed-set C-Bench with controlled prompt/layout complexity and an open-set O-Bench using real-world prompts and layouts to test performance “in the wild.”
  • The authors propose a unified evaluation protocol that combines semantic and spatial accuracy into a single score to enable consistent and comparable model ranking.
  • They run a large-scale evaluation of six state-of-the-art layout-guided diffusion models, generating and evaluating 319,086 images, and publish an overall ranking plus detailed breakdowns for text vs. layout alignment.
  • Additional analyses examine how model strengths and weaknesses vary across scenarios and prompt complexities, and the accompanying code is released on GitHub.

Abstract

Evaluating layout-guided text-to-image generative models requires assessing both semantic alignment with textual prompts and spatial fidelity to prescribed layouts. Assessing layout alignment requires collecting fine-grained annotations, which is costly and labor-intensive. Consequently, current benchmarks rarely provide comprehensive layout evaluation and often remain limited in scale or coverage, making model comparison, ranking, and interpretation difficult. In this work, we introduce a closed-set benchmark (C-Bench) designed to isolate key generative capabilities while providing varying levels of complexity in both prompt structure and layout. To complement this controlled setting, we propose an open-set benchmark (O-Bench) that evaluates models using real-world prompts and layouts, offering a measure of semantic and spatial alignment in the wild. We further develop a unified evaluation protocol that combines semantic and spatial accuracy into a single score, ensuring consistent model ranking. Using our benchmarks, we conduct a large-scale evaluation of six state-of-the-art layout-guided diffusion models, totaling 319,086 generated and evaluated images. We establish a model ranking based on their overall performance and provide detailed breakdowns for text and layout alignment to enhance interpretability. Fine-grained analyses across scenarios and prompt complexities highlight the strengths and limitations of current models. Code is available at https://github.com/lparolari/cobench.