Why can a smart AI
only write one word at a time?
Until now, language models have spun out text left to right, one word after another. DiffusionGemma does away with that "waiting in line" and shapes the whole passage at once. On your own PC, generation runs up to four times faster—and here we unpack exactly how it works, with diagrams.
What held back speed was
"waiting in line"
GPT, Claude, Gemma—every large language model (LLM) so far has built text using a method called autoregression. Looking at the words written up to that point, it predicts the next single word, and repeats. For a 100-word answer, that means 100 turns of waiting in line in principle.
This was an easily overlooked weakness. No matter how fast the GPU, you cannot skip the sequential nature itself—"the next word can't come out until the previous one is settled." Especially when running not in the cloud but on your own PC, this word-by-word wait was the single biggest weight on perceived speed.
| Autoregressive (conventional) | Diffusion (DiffusionGemma) |
|---|---|
| Generates left to right, one word at a time | Shapes the entire frame in parallel, all at once |
| Needs as many advances as output words | Locks in several words at once in a few steps |
| The longer the answer, the longer the wait | Up to 4× faster for local generation |
| Hard to parallelize because it is sequential | Up to 256 tokens in parallel per step |
Rather than writing left to right,
let the text emerge all at once from the fog.
This is how diffusion writes
It's the same idea as the "diffusion models" used in image generation. First the whole space is filled with noise, then it is gradually refined until the text rises into view.
Fill with noise
First, the place where output will go is blanketed with meaningless "noise" tokens. Rather than starting from the left edge as autoregression does, the idea is to reserve a "draft frame" for the entire answer up front.
Refine all at once (denoise)
At each step it surveys the whole frame and gradually replaces noise with the correct words. The decisive difference from autoregression is that it can lock in several words at once, not one at a time.
Rises in a few rounds
Repeat the refining a few steps and the text appears, like fog clearing. Because the number of repetitions barely changes even as the output grows, the longer the answer, the more the speed benefit kicks in.
Why is it 4× faster
on your own PC?
Speed comes from two things: an MoE that "runs only part of the brain," and a diffusion head that "outputs in bulk."
The key is MoE (Mixture of Experts). While carrying a large 26B body, what actually runs in a single pass is only 3.8B worth. Because it wakes only the "experts" needed for the question, both power draw and compute stay down. That's exactly why it can run realistically not in a data center but on a GPU at hand.
On top of that sits a diffusion head. Where autoregression meant "N advances to produce N words," diffusion denoises up to 256 tokens in bulk in a single step. In other words, it can decouple the number of repetitions from the length of the output. That is the logic behind a speedup that pays off most on long passages.
Try it right now, on your own machine
Released as open weights under Apache 2.0, with support in major tools from day one.
Run on a local GPU
On a PC with an RTX card, you can generate without worrying about API billing or token limits. NVIDIA has already optimized it for RTX / DGX Spark.
Slot into an existing pipeline
Day-zero support in Hugging Face Transformers, vLLM, and Unsloth. You can drop it straight into the setup you're already running.
Edge and on-prem inference
Even in-house data that can't be sent to the cloud stays on your own machine. It also pairs well with edge inference that has no network latency.
This will be a turning point
Attempts to use diffusion for text generation had so far stayed at the experimental stage at specialist startups such as Mercury (Inception). This time, the fact that a major frontier lab has implemented and released a text diffusion architecture at general-purpose LLM scale for the first time marks an industry turning point. The very premise that "speed is a trade-off against quality" is beginning to waver.
Of course, it does not fully match the quality of the most advanced models. The realistic way to see it is that another option has appeared for how to balance speed and quality. For those satisfied with a cloud API the difference may be a rounding error, but for the local-environment crowd and anyone eyeing edge inference, it is a solid new move.