A language model factors the probability of a sequence into a chain of next-token predictions. At inference time this becomes a loop: predict, sample, append, repeat.
A language model defines a probability distribution over sequences of tokens. Training objective: minimise cross-entropy between the model's predicted distribution and the actual next token in the training data.
−log P(correct token). If the model gives the right next token probability 0.9, loss = −log(0.9) ≈ 0.1. If it gives 0.01, loss = −log(0.01) ≈ 4.6. The log makes confident wrong answers very expensive. Summed over all positions and averaged, this is the number the optimiser minimises.
Text must be converted to integer token IDs before the model can process it. Modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece — subword tokenisers that balance vocabulary size against token count.
At each step the model outputs a logit vector over the entire vocabulary. Sampling strategy controls how we pick the next token from this distribution.
Context: "The quick brown fox"
Every forward pass computes Keys and Values for each token in the context. Without caching, generating token N requires re-computing K,V for all N−1 previous tokens. The KV cache stores these computed values and reuses them.
Self-attention computes pairwise interactions between all tokens. For a sequence of length n, this requires O(n²) operations and O(n²) memory. This is why context windows have historically been limited, and why extending them is expensive.