Autoregressive generation is fundamentally sequential — each token depends on all previous outputs. This seriality bottlenecks modern hardware that was built for parallel computation.
During training, all target tokens are known in advance. The entire sequence can be processed in one forward pass — highly parallel. During inference, each token must be generated before the next one can start — strictly serial.
Each time the model generates one token, it must load ALL model weights from GPU memory (HBM) into compute units. For a 70B parameter model at float16, that's 140 GB read per token — regardless of whether batch size is 1 or 1000.
Arithmetic intensity = FLOPs / bytes read. Attention is compute-bound during training (process many tokens = high intensity). Token generation is memory-bound (one token per weight load = low intensity).
Latency is how long a single user waits for their response. Throughput is how many tokens per second the server produces across all users. These are in direct tension: small batches give low latency, large batches give high throughput.
The batch size controls how many users are processed in a single GPU forward pass. Batch=1 means only one user's token is generated per step — all other users queue. Batch=4 means all four users receive their next token in the same step, at no extra latency cost.
Speculative decoding uses a small fast 'draft' model to generate N tokens cheaply, then the large target model verifies all N in a single parallel forward pass. If the large model agrees, you get N tokens for the price of ~1 large-model step.
These constraints — serial decoding, memory bandwidth bottleneck, latency-throughput tension — motivate entirely new model architectures that generate all tokens in parallel rather than one at a time. That's where diffusion language models come in.
Each limitation described here is a concrete engineering pressure: hardware designed for parallel dense matrix multiplications sits underutilised during autoregressive generation. The question isn't whether alternatives will emerge, but which tradeoffs they make.