共有:
Text Diffusion Model

Why can a smart AI
only write one word at a time?

Until now, language models have spun out text left to right, one word after another. DiffusionGemma does away with that "waiting in line" and shapes the whole passage at once. On your own PC, generation runs up to four times faster—and here we unpack exactly how it works, with diagrams.

AI Navigate Editorial·2026.06.11·6 min read
AUTOREGRESSIVE word → word → … repeats once per output word DIFFUSION noise a few steps shapes the whole at once
01
The Bottleneck

What held back speed was
"waiting in line"

GPT, Claude, Gemma—every large language model (LLM) so far has built text using a method called autoregression. Looking at the words written up to that point, it predicts the next single word, and repeats. For a 100-word answer, that means 100 turns of waiting in line in principle.

This was an easily overlooked weakness. No matter how fast the GPU, you cannot skip the sequential nature itself—"the next word can't come out until the previous one is settled." Especially when running not in the cloud but on your own PC, this word-by-word wait was the single biggest weight on perceived speed.

Autoregressive (conventional)Diffusion (DiffusionGemma)
Generates left to right, one word at a timeShapes the entire frame in parallel, all at once
Needs as many advances as output wordsLocks in several words at once in a few steps
The longer the answer, the longer the waitUp to 4× faster for local generation
Hard to parallelize because it is sequentialUp to 256 tokens in parallel per step

Rather than writing left to right,
let the text emerge all at once from the fog.


02
How It Works

This is how diffusion writes

It's the same idea as the "diffusion models" used in image generation. First the whole space is filled with noise, then it is gradually refined until the text rises into view.

t = 3 noise t = 1 taking shape t = 0 finalized in parallel
FIG. A noise-filled frame is refined over a few steps, locking in several words at once
01

Fill with noise

First, the place where output will go is blanketed with meaningless "noise" tokens. Rather than starting from the left edge as autoregression does, the idea is to reserve a "draft frame" for the entire answer up front.

02

Refine all at once (denoise)

At each step it surveys the whole frame and gradually replaces noise with the correct words. The decisive difference from autoregression is that it can lock in several words at once, not one at a time.

03

Rises in a few rounds

Repeat the refining a few steps and the text appears, like fog clearing. Because the number of repetitions barely changes even as the output grows, the longer the answer, the more the speed benefit kicks in.

03
Under the Hood

Why is it 4× faster
on your own PC?

Speed comes from two things: an MoE that "runs only part of the brain," and a diffusion head that "outputs in bulk."

26B EXPERTS only 3.8B runs per step DIFFUSION HEAD denoises up to 256 tokens at once
FIG. Only part of the huge model runs (sparse computation), and the output is finalized in parallel, all at once
26B
total parameters (MoE)
3.8B
activated per step only
×4
speedup for local generation

The key is MoE (Mixture of Experts). While carrying a large 26B body, what actually runs in a single pass is only 3.8B worth. Because it wakes only the "experts" needed for the question, both power draw and compute stay down. That's exactly why it can run realistically not in a data center but on a GPU at hand.

On top of that sits a diffusion head. Where autoregression meant "N advances to produce N words," diffusion denoises up to 256 tokens in bulk in a single step. In other words, it can decouple the number of repetitions from the length of the output. That is the logic behind a speedup that pays off most on long passages.

04
In Practice

Try it right now, on your own machine

Released as open weights under Apache 2.0, with support in major tools from day one.

Run on a local GPU

On a PC with an RTX card, you can generate without worrying about API billing or token limits. NVIDIA has already optimized it for RTX / DGX Spark.

Slot into an existing pipeline

Day-zero support in Hugging Face Transformers, vLLM, and Unsloth. You can drop it straight into the setup you're already running.

Edge and on-prem inference

Even in-house data that can't be sent to the cloud stays on your own machine. It also pairs well with edge inference that has no network latency.


05
Frontier

This will be a turning point

Attempts to use diffusion for text generation had so far stayed at the experimental stage at specialist startups such as Mercury (Inception). This time, the fact that a major frontier lab has implemented and released a text diffusion architecture at general-purpose LLM scale for the first time marks an industry turning point. The very premise that "speed is a trade-off against quality" is beginning to waver.

Of course, it does not fully match the quality of the most advanced models. The realistic way to see it is that another option has appeared for how to balance speed and quality. For those satisfied with a cloud API the difference may be a rounding error, but for the local-environment crowd and anyone eyeing edge inference, it is a solid new move.

AI Navigate — Daily Update · 2026.06.11