← Back

Lesson 01

Attention Mechanism

The key innovation of the transformer. Instead of compressing a sequence into a fixed vector, attention lets every token directly query every other — weighted by relevance.

The Sequence Problem

Before attention, sequence models (RNNs, LSTMs) compressed the entire past into a fixed-size hidden state — a bottleneck for long sequences. Attention removes this bottleneck: every token can attend directly to any other token, regardless of distance.

Coreference example: A sentence like "The animal didn't cross the street because it was too tired" — resolving what "it" refers to requires connecting tokens that may be far apart. Attention handles this naturally by computing pairwise relevance scores between all token positions simultaneously.

Queries, Keys, and Values

Attention(Q, K, V) = softmax( QKT / √dk ) · V
     ↑ query       ↑ key             ↑ value

Each token projects itself into three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I contribute?). The dot product Q·KT measures how well each query matches each key. After scaling and softmax, these weights blend the values.

Click a token to see its attention distribution

Scaled Dot-Product — Softmax Temperature

Temperature is a number you divide the logits by before softmax. A higher temperature spreads probability more evenly across all options; a lower temperature concentrates it on the top candidate.

Low temperature (→ 0): Softmax becomes winner-takes-all. The highest score gets nearly all the weight — attention focuses sharply on one token. Deterministic and confident, but rigid.
High temperature (→ ∞): All scores get compressed toward zero before softmax, so every token gets nearly equal weight — attention becomes diffuse and uniform. The model stops discriminating.

The √dk divisor in the attention formula is a fixed temperature correction. Dot products grow proportionally to the embedding dimension dk — two random vectors of dimension 64 will have dot products roughly 8× larger than dimension 1 vectors. Without scaling, large dk acts like a very low temperature: the model focuses almost entirely on one token, gradients vanish, and training stalls.

Rule of thumb: dividing by √dk keeps dot-product magnitudes near 1 regardless of model size, so the softmax operates in its well-behaved linear region during training.

Drag the slider to feel the effect. The bars show the probability each word receives as the next token after "The fox quietly ___".

Multi-Head Attention

Instead of one attention function, transformers run H heads in parallel — each with its own learned Q/K/V projection matrices. Training pushes each head toward a different specialisation. The three patterns below represent real types of attention that emerge in trained models.

MultiHead(Q,K,V) = Concat(head1,...,headh)WO
  where headi = Attention(QWiQ, KWiK, VWiV)

Each matrix is an attention map for "The cat sat on the mat". Rows = from (query token), columns = to (key token). Brighter cell = higher attention weight.

Head 1 — Syntactic: tracks grammatical dependencies. "The" → "cat" (determiner→noun), "cat" → "sat" (subject→verb), "on" → "mat" (preposition→object). Pattern is sparse and asymmetric — it follows the parse tree.
Head 2 — Coreference: links identical or semantically paired words. "The" (pos 0) and "the" (pos 4) attend to each other; so do "cat" and "mat". The pattern is symmetric — bright cells mirror across the diagonal.
Head 3 — Positional: attends to nearby tokens regardless of content. Attention weight falls off smoothly with distance. The pattern is a bright diagonal band — a local context window.

Positional Encoding

Self-attention is permutation-invariant — it has no inherent sense of token order. Positional encodings inject position information by adding a fixed vector to each token embedding before the first attention layer. The sinusoidal encoding uses different frequencies for different dimensions, creating a unique fingerprint for each position.

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)

Positional encoding heatmap — 24 positions × 16 dimensions