When Do Diffusion Models learn to Generate Multiple Objects?

arXiv cs.AI / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why text-to-image diffusion models are unreliable at generating multiple objects, focusing specifically on whether the limitation comes from training data or from model learning itself.
It separates “concept generalization” (individual concepts observed during training with potentially imbalanced frequencies) from “compositional generalization” (certain concept combinations are intentionally withheld) to isolate different failure modes.
Using a controlled synthetic dataset generation framework called “mosaic,” the authors train diffusion models and find that overall scene complexity is the dominant factor behind multi-object generation failures rather than concept imbalance.
The study also shows that learning to perform counting is uniquely difficult when training data is scarce, and that compositional generalization degrades sharply when more concept combinations are held out.
The results suggest fundamental limitations of current diffusion models for multi-object compositional generation and motivate better inductive biases and more deliberate dataset design to improve robustness.

Abstract

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.