Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation of text-to-image diffusion/flow models: they struggle to precisely control image layout using only natural language, motivating controllable generation methods with extra conditioning.
  • Prior approaches typically train the augmented controllable network using the same loss as the original model, but the authors show this can cause long training times before convergence.
  • They propose revisiting the controllable diffusion training objective using $x_0$-supervision (direct supervision on the clean target image) or an equivalent re-weighting of the diffusion loss to speed convergence.
  • Experiments across multiple control settings report up to 2× faster convergence (measured via mAUCC) alongside improvements in visual quality and conditioning accuracy.
  • The authors provide an open-source implementation at the linked GitHub repository.

Abstract

Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed x_0-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2\times according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision