Guiding a Diffusion Model by Swapping Its Tokens

arXiv cs.CV / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Self-Swap Guidance (SSG), a CFG-like inference technique that enables guidance for both conditional and unconditional diffusion generation.
SSG works by creating a perturbed prediction using targeted token-latent swap operations, then steering sampling using the direction between perturbed and clean predictions toward higher-fidelity distributions.
The method performs fine-grained swaps of pairs of semantically dissimilar token latents across spatial or channel dimensions, offering more constrained perturbation than prior approaches.
Experiments on MS-COCO 2014/2017 and ImageNet show SSG improves image fidelity and prompt alignment compared with prior condition-free methods, while also improving robustness across perturbation strengths.
The authors claim SSG can be applied as a plug-in to existing diffusion models, requiring minimal integration effort to obtain immediate improvements.

Abstract

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

Dev.to

How To Optimize Enterprise AI Energy Consumption

Dev.to

Guiding a Diffusion Model by Swapping Its Tokens

Key Points

Abstract

Related Articles

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

How To Optimize Enterprise AI Energy Consumption

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer