Generate all tokens in parallel by learning to denoise masked sequences. Diffusion LLMs trade autoregressive guarantees for throughput — a natural fit for GPU hardware.
Image diffusion models corrupt an image with Gaussian noise, then learn to reverse the process. For discrete tokens, Gaussian noise doesn't apply — instead we use masking: the forward process randomly replaces tokens with a [MASK] token.
The diffusion model learns the reverse: given a (partially) masked sequence, predict all masked tokens simultaneously. At inference time, we start from fully masked and iteratively denoise, updating ALL token positions in parallel at each step.
(All tokens processed in parallel each step — contrast with autoregressive which generates one token per step)
Autoregressive: generate L tokens = L sequential steps. Each step = one full model forward pass on growing context. Diffusion: generate L tokens in T denoising steps (T << L). Each step = one full model forward pass on all L tokens in parallel.
Autoregressive generation at batch=1 is heavily memory-bandwidth bound. Each token generation requires loading all model weights but performs very few multiply-accumulate operations on them. Diffusion forward passes process all L tokens simultaneously — high compute utilisation.
Diffusion LLMs are not strictly better than autoregressive — they trade different things.
| Aspect | Autoregressive | Diffusion LLM |
|---|---|---|
| Generation direction | Left-to-right, one token | All positions, parallel |
| Steps for L tokens | L forward passes | T << L forward passes (typ. 10–50) |
| GPU compute utilisation | Low (~15% at batch=1) | High (~70–80%) |
| Output quality | State of the art | Competitive, some coherence gaps |
| Streaming output | Natural (token-by-token) | Requires all steps first |
| Controllability | High (logprobs exact) | More complex |
| Long sequence throughput | Slow (O(L) serial steps) | Fast (O(T) parallel steps) |