BlendFusion -- Scalable Synthetic Data Generation for Diffusion Model Training

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Diffusion model training increasingly uses synthetic image-caption data, but purely model-generated images can cause visual inconsistencies and a feedback loop that leads to “Model Autophagy Disorder” (MAD).
The paper introduces BlendFusion, a scalable synthetic data generation framework that renders images from 3D scenes via path tracing, aiming to produce more consistent training data for diffusion models.
BlendFusion combines object-centric camera placement, robust filtering, and automatic captioning to generate high-quality image-caption pairs.
Using this pipeline, the authors curate FineBLEND, an image-caption dataset built from diverse 3D scenes, and evaluate it against several established image-caption datasets.
The authors release an open-source, highly configurable framework intended for others to generate their own datasets from 3D scenes, and show object-centric camera placement improves results over object-agnostic sampling.

Abstract

With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.