Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

arXiv cs.CV / 3/23/2026

📰 NewsModels & Research

共有:

Key Points

The authors propose a two-stage acceleration framework for diffusion decoders used in image tokenization, combining multi-scale sampling and one-step distillation.
Multi-scale sampling starts decoding at a coarse resolution and progressively doubles the resolution at each stage, yielding a theoretical O(log n) speedup over standard full-resolution sampling.
At each scale, the diffusion decoder is distilled into a single-step denoising model, allowing fast reconstructions with a single forward pass per scale.
The combined approach achieves an order-of-magnitude reduction in decoding time with little degradation in output quality, enabling practical, real-time or large-scale image tokenizers and influencing future work in efficient visual tokenization and downstream generation.

Abstract

Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of

\mathcal{O}(\log n)

compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to

My AI Does Not Have a Clock

Dev.to

How to settle on a coding LLM ? What parameters to watch out for ?

Reddit r/LocalLLaMA

Andrej Karpathy's autonomous AI research agent ran 700 experiments in 2 days and gave a glimpse of where AI is heading

Reddit r/artificial

So cursor admits that Kimi K2.5 is the best open source model

Reddit r/LocalLLaMA

Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

Key Points

Abstract

Related Articles

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

My AI Does Not Have a Clock

How to settle on a coding LLM ? What parameters to watch out for ?

Andrej Karpathy's autonomous AI research agent ran 700 experiments in 2 days and gave a glimpse of where AI is heading

So cursor admits that Kimi K2.5 is the best open source model

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer