Lesson 05

Diffusion Language Models

Generate all tokens in parallel by learning to denoise masked sequences. Diffusion LLMs trade autoregressive guarantees for throughput — a natural fit for GPU hardware.

From image to text diffusion

Image diffusion models corrupt an image with Gaussian noise, then learn to reverse the process. For discrete tokens, Gaussian noise doesn't apply — instead we use masking: the forward process randomly replaces tokens with a [MASK] token.

Forward process — each word above, synonym the model may regenerate below

The reverse process — parallel denoising

The diffusion model learns the reverse: given a (partially) masked sequence, predict all masked tokens simultaneously. At inference time, we start from fully masked and iteratively denoise, updating ALL token positions in parallel at each step.

Notice the synonyms. The reverse process doesn't recover the exact original — it samples a plausible completion. Each token position had candidates: "language" → text, "models" → networks, "learn" → train, "vast" → huge, "training" → web, "data" → datasets. Diffusion generates, not copies.

Key insight: all token positions updated in parallel at every step

(All tokens processed in parallel each step — contrast with autoregressive which generates one token per step)

Parallel generation timeline

Autoregressive: generate L tokens = L sequential steps. Each step = one full model forward pass on growing context. Diffusion: generate L tokens in T denoising steps (T << L). Each step = one full model forward pass on all L tokens in parallel.

Wall-clock time comparison (L = 16 tokens)

GPU utilisation

Autoregressive generation at batch=1 is heavily memory-bandwidth bound. Each token generation requires loading all model weights but performs very few multiply-accumulate operations on them. Diffusion forward passes process all L tokens simultaneously — high compute utilisation.

Hardware utilisation — AR generation vs diffusion step

Each diffusion denoising step is arithmetically similar to a training forward pass — full model, full sequence, all tokens. This maps directly to what GPUs are designed for: dense matrix multiplications on large batches.

Tradeoffs and current models

Diffusion LLMs are not strictly better than autoregressive — they trade different things.

Aspect	Autoregressive	Diffusion LLM
Generation direction	Left-to-right, one token	All positions, parallel
Steps for L tokens	L forward passes	T << L forward passes (typ. 10–50)
GPU compute utilisation	Low (~15% at batch=1)	High (~70–80%)
Output quality	State of the art	Competitive, some coherence gaps
Streaming output	Natural (token-by-token)	Requires all steps first
Controllability	High (logprobs exact)	More complex
Long sequence throughput	Slow (O(L) serial steps)	Fast (O(T) parallel steps)

Key models: MDLM (Masked Diffusion Language Model, 2024), SEDD (Score Entropy Discrete Diffusion, 2023), and Mercury (Inception Labs, 2025 — first commercial diffusion LLM) are showing that the approach is viable at scale.

The future likely combines both: diffusion for high-throughput bulk generation, autoregressive for streaming and precise control. The hardware trend — GPUs optimised for dense parallel compute — strongly favours diffusion's workload profile.

← Limits of Serial Decoding