Lesson 01
The key innovation of the transformer. Instead of compressing a sequence into a fixed vector, attention lets every token directly query every other — weighted by relevance.
Before attention, sequence models (RNNs, LSTMs) compressed the entire past into a fixed-size hidden state — a bottleneck for long sequences. Attention removes this bottleneck: every token can attend directly to any other token, regardless of distance.
Each token projects itself into three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I contribute?). The dot product Q·KT measures how well each query matches each key. After scaling and softmax, these weights blend the values.
Click a token to see its attention distribution
Temperature is a number you divide the logits by before softmax. A higher temperature spreads probability more evenly across all options; a lower temperature concentrates it on the top candidate.
The √dk divisor in the attention formula is a fixed temperature correction. Dot products grow proportionally to the embedding dimension dk — two random vectors of dimension 64 will have dot products roughly 8× larger than dimension 1 vectors. Without scaling, large dk acts like a very low temperature: the model focuses almost entirely on one token, gradients vanish, and training stalls.
Drag the slider to feel the effect. The bars show the probability each word receives as the next token after "The fox quietly ___".
Instead of one attention function, transformers run H heads in parallel — each with its own learned Q/K/V projection matrices. Training pushes each head toward a different specialisation. The three patterns below represent real types of attention that emerge in trained models.
Each matrix is an attention map for "The cat sat on the mat". Rows = from (query token), columns = to (key token). Brighter cell = higher attention weight.
Self-attention is permutation-invariant — it has no inherent sense of token order. Positional encodings inject position information by adding a fixed vector to each token embedding before the first attention layer. The sinusoidal encoding uses different frequencies for different dimensions, creating a unique fingerprint for each position.
Positional encoding heatmap — 24 positions × 16 dimensions