Efficient Diffusion Distillation via Embedding Loss

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “Embedding Loss” (EL), a new supplementary loss for diffusion model distillation that aims to improve generation quality and speed up training.
  • Unlike regression-based or GAN-based supplementary losses, EL aligns feature distributions in an embedding space using Maximum Mean Discrepancy (MMD), enabling robust matching without requiring massive pre-generated datasets.
  • EL uses feature embeddings from a diverse set of randomly initialized networks to preserve sample fidelity and diversity during distillation, particularly for one-step (fewest-step) generators.
  • Experiments report state-of-the-art FID scores on CIFAR-10 (1.475 unconditional, 1.380 conditional) and consistent gains across multiple datasets and distillation frameworks such as ImageNet, AFHQ-v2, and FFHQ.
  • The method can reduce training iterations by up to 80%, making diffusion-based generative models more feasible for resource-constrained researchers and deployment scenarios.

Abstract

Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher's performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.