Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

arXiv cs.CV / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper explains why standard Classifier-Free Guidance (CFG) in diffusion models struggles: a globally uniform guidance scalar leads to a “detail-artifact dilemma,” where low guidance loses semantics and high guidance causes structural/color and temporal artifacts.
  • By using differential geometry and Tweedie’s Formula, the authors argue that CFG effectively performs a tangential linear extrapolation that becomes problematic on a highly curved data manifold, creating large orthogonal deviation.
  • They derive theoretical upper bounds on safe guidance and introduce Spatial Adaptive Multi Guidance (SAMG) to adapt guidance spatially and point-wise during sampling.
  • SAMG is described as training-free and virtually zero-cost, using conservative minimum guidance near high-energy boundaries to protect micro-textures while applying aggressive maximum guidance in low-energy areas to improve semantic injection.
  • Experiments on multiple image and video diffusion architectures show SAMG improves semantic alignment, structural fidelity, and temporal smoothness while avoiding extra computational overhead.

Abstract

Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.