Lesson 03

Autoregressive Generation

A language model factors the probability of a sequence into a chain of next-token predictions. At inference time this becomes a loop: predict, sample, append, repeat.

Language modeling

A language model defines a probability distribution over sequences of tokens. Training objective: minimise cross-entropy between the model's predicted distribution and the actual next token in the training data.

// Joint probability as product of conditionals
P(x₁, x₂, ..., xₙ) = ∏ P(xₜ | x₁, ..., x_{t-1})
t=1→n

// Training loss (cross-entropy)
Training loss: L = -1/N · Σ log P(xₜ | x₁..x_{t-1})

Cross-entropy — what it measures: for each position in the training text, the model outputs a probability for every token in the vocabulary. Cross-entropy is −log P(correct token). If the model gives the right next token probability 0.9, loss = −log(0.9) ≈ 0.1. If it gives 0.01, loss = −log(0.01) ≈ 4.6. The log makes confident wrong answers very expensive. Summed over all positions and averaged, this is the number the optimiser minimises.

Teacher forcing: during training, the model sees ground-truth previous tokens, not its own predictions. At inference it must use its own outputs — a gap called exposure bias.

Tokenisation

Text must be converted to integer token IDs before the model can process it. Modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece — subword tokenisers that balance vocabulary size against token count.

Temperature and sampling

At each step the model outputs a logit vector over the entire vocabulary. Sampling strategy controls how we pick the next token from this distribution.

Context: "The quick brown fox"

Temperature: 1.00

Next-token probability distribution

KV Cache

Every forward pass computes Keys and Values for each token in the context. Without caching, generating token N requires re-computing K,V for all N−1 previous tokens. The KV cache stores these computed values and reuses them.

Attention operations per step

Context length and cost

Self-attention computes pairwise interactions between all tokens. For a sequence of length n, this requires O(n²) operations and O(n²) memory. This is why context windows have historically been limited, and why extending them is expensive.

Compute cost vs context length

Flash Attention rewrites the attention algorithm to never materialise the full n×n attention matrix, reducing memory from O(n²) to O(n) at the cost of recomputation. This is what makes 1M-token contexts feasible.

← The Transformer Limits of Serial Decoding →