Diffusion Mental Averages

arXiv cs.CV / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Diffusion Mental Averages (DMA), aiming to generate a single “sharp and realistic” prototype of a concept directly from a diffusion model rather than averaging images outside the model.
  • It argues that prior data-centric averaging on diffusion samples yields blur, and proposes instead averaging in the model’s evolving semantic space by aligning multiple denoising trajectories so they converge from coarse to fine semantics.
  • DMA is framed as an optimization problem over multiple noise latents, producing a consistent visual summary and a way to probe how concepts are represented and biased in the diffusion process.
  • For multimodal concepts (e.g., many dog breeds), the method clusters samples in semantically rich embedding spaces like CLIP and then uses Textual Inversion or LoRA to connect CLIP clusters to diffusion space.
  • The authors claim it is the first approach to deliver consistent, realistic averages for both concrete and abstract concepts using this within-model averaging and trajectory-alignment strategy.

Abstract

Can a diffusion model produce its own "mental average" of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model's semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.