$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

arXiv cs.CV / 4/28/2026

📰 NewsModels & Research

Key Points

  • The paper argues that standard Classifier-Free Guidance (CFG) misses the data manifold’s intrinsic curvature, motivating multi-step zigzag-style trajectories for better semantic alignment.
  • It proposes “Implicit Z-Sampling” to theoretically eliminate off-manifold truncation errors and reduce explicit zigzag evaluation costs.
  • Building on this, the authors introduce “$Z^2$-Sampling,” which uses temporal coherence from the Probability Flow ODE plus a cached Temporal Semantic Surrogate to return sampling efficiency to the standard 2-NFE baseline.
  • The method is theoretically analyzed with Backward Error Analysis, showing that the discrete collapse induces a directional-derivative curvature penalty that preserves semantic exploration.
  • Experiments indicate $Z^2$-Sampling breaks the performance–efficiency Pareto frontier and works broadly across different diffusion architectures (U-Nets, DiTs) and modalities (image/video), fitting with other alignment methods.

Abstract

Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce Z^2-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE's temporal coherence, Z^2-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that Z^2-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).