← Back
Lesson 04

Limits of Serial Decoding

Autoregressive generation is fundamentally sequential — each token depends on all previous outputs. This seriality bottlenecks modern hardware that was built for parallel computation.

Training vs inference parallelism

During training, all target tokens are known in advance. The entire sequence can be processed in one forward pass — highly parallel. During inference, each token must be generated before the next one can start — strictly serial.

Parallelism comparison

The memory bandwidth wall

Each time the model generates one token, it must load ALL model weights from GPU memory (HBM) into compute units. For a 70B parameter model at float16, that's 140 GB read per token — regardless of whether batch size is 1 or 1000.

Arithmetic intensity = FLOPs / bytes read. Attention is compute-bound during training (process many tokens = high intensity). Token generation is memory-bound (one token per weight load = low intensity).

1
Roofline model — arithmetic intensity vs throughput

Latency vs throughput

Latency is how long a single user waits for their response. Throughput is how many tokens per second the server produces across all users. These are in direct tension: small batches give low latency, large batches give high throughput.

The batch size controls how many users are processed in a single GPU forward pass. Batch=1 means only one user's token is generated per step — all other users queue. Batch=4 means all four users receive their next token in the same step, at no extra latency cost.

Request scheduling — Gantt timeline

Speculative decoding

Speculative decoding uses a small fast 'draft' model to generate N tokens cheaply, then the large target model verifies all N in a single parallel forward pass. If the large model agrees, you get N tokens for the price of ~1 large-model step.

Speculative decoding timeline
In practice speculative decoding gives 2–3× speedup on the target model. The acceptance rate depends on alignment between draft and target distributions. It works because the large model verifying is cheaper than the large model generating.

Why this matters for the future

These constraints — serial decoding, memory bandwidth bottleneck, latency-throughput tension — motivate entirely new model architectures that generate all tokens in parallel rather than one at a time. That's where diffusion language models come in.

Each limitation described here is a concrete engineering pressure: hardware designed for parallel dense matrix multiplications sits underutilised during autoregressive generation. The question isn't whether alternatives will emerge, but which tradeoffs they make.