Lesson 00 — Precursors

Before Attention

Every architecture choice in the transformer was a response to a concrete failure mode in what came before. Understanding the bottlenecks of RNNs and LSTMs makes the design of attention feel inevitable.

A brief history of sequence modelling

Language modelling predates neural networks. Each era solved the previous era's failure mode — and introduced a new one.

1948 – 1990s

N-gram models

Count co-occurrences. Predict next word from the last N−1 words. No parameters — pure statistics.

2001 – 2013

Feed-forward LMs

Bengio et al. — embed words, concatenate a fixed window, predict with a neural net. Fixed context window.

2013 – 2016

RNN / LSTM

Recurrent hidden state lets the model process unbounded sequences. But the state is a fixed-size bottleneck.

2014 – 2017

Seq2Seq

Encoder compresses a full sentence to one vector; decoder generates from it. Bottleneck is painfully visible on long inputs.

2017 →
Attention / Transformer
Every output token directly attends to every input token. The bottleneck is gone.

N-gram language models

The simplest language model: count how often each word follows a given context in a training corpus. Given the last N−1 words, the next word is predicted by looking up the most common continuation.

P(w_t | w₁…w_t-1) ≈ P(w_t | w_t-N+1…w_t-1) ← Markov assumption: only last N−1 words matter

Click a word below to see its bigram continuation probabilities from the training corpus "the cat sat on the mat the cat lay on the rug a cat sat on a mat".

Bigram probabilities — P(next word | selected word)

Failure mode: no long-range memory. N-gram models are blind to anything outside their window. "The trophy didn't fit in the suitcase because it was too big" — resolving "it" requires connecting tokens that are 7 positions apart. A trigram model simply cannot do this.

Data sparsity. A trigram model on a 50,000-word vocabulary needs 50,000³ = 125 billion possible entries. Most are never observed. N-grams require enormous smoothing tricks just to assign non-zero probability to unseen combinations.

Recurrent neural networks

An RNN processes a sequence one token at a time. At each step it combines the new input with the previous hidden state to produce a new hidden state. The hidden state acts as a running "memory" of everything seen so far.

h_t = tanh( W_h·h_t-1 + W_x·x_t + b ) ← new hidden state from old state + current input
y_t = softmax( W_y·h_t ) ← prediction from hidden state

Step through the sentence below. Watch the hidden state (coloured bar) — each segment represents how much of the state is devoted to each token. Earlier tokens fade as new ones arrive.

RNN processing — coloured bar shows each token's share of the hidden state

Press "Process next token" to start.

The vanishing gradient problem. Training an RNN requires backpropagating error through every time step. Gradients are multiplied by W_h at each step — if ‖W_h‖ < 1 they shrink to zero; if ‖W_h‖ > 1 they explode. In practice, gradients vanish after ~10–20 steps. The network cannot learn long-range dependencies.

Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber (1997) solved the vanishing gradient problem with gating. An LSTM maintains two states: a hidden state h_t (short-term) and a cell state C_t (long-term). Three learned gates control what to forget, what to write, and what to read.

f_t = σ(W_f·[h_t-1, x_t] + b_f)  ← forget gate: what to erase from cell state
i_t = σ(W_i·[h_t-1, x_t] + b_i)  ← input gate: what new info to write
C_t = f_t ⊙ C_t-1 + i_t ⊙ tanh(W_c·[h_t-1, x_t])  ← update cell state
o_t = σ(W_o·[h_t-1, x_t] + b_o)  ← output gate: what to expose as h_t

The cell state C_t flows through the network with only multiplicative interactions — the "constant error carousel" that allows gradients to flow without vanishing. Click each gate to see what it does.

LSTM cell — click a gate to highlight its role

Forget gate — decides what fraction of the old cell state to keep. Near 0 = erase, near 1 = preserve.

LSTMs work well and were state of the art for translation, speech, and language modelling from ~2014–2017. They can carry information hundreds of steps. But the fundamental architecture still compresses the past into a fixed-size vector — just more carefully.

Still a bottleneck. No matter how carefully the gates learn to preserve information, the hidden state has a fixed dimensionality (typically 256–2048). A sentence of 100 tokens must be compressed into that fixed vector. For translation, the decoder must reconstruct the full meaning from this single compressed representation.

Seq2Seq and the fixed-vector bottleneck

Sutskever et al. (2014) introduced Sequence-to-Sequence: an LSTM encoder reads the input sentence and produces a fixed-size context vector; an LSTM decoder generates the output from that vector alone. For short sentences it works well. For long sentences the bottleneck breaks it.

Encoder compresses the full input to one vector — decoder reads only that

Bahdanau et al. (2015) — "Neural Machine Translation by Jointly Learning to Align and Translate" — solved this by letting the decoder attend to all encoder hidden states, not just the last one. This paper introduced attention. The transformer (Vaswani et al. 2017) then removed the RNN entirely, using only attention.

The bottleneck visualised

Here is the core problem in one picture. An RNN must compress an arbitrarily long input into a fixed-size vector before any output can be generated. Information from early tokens is progressively overwritten.

Information capacity — drag the slider to see how signal from the first token decays

Sequence length: 5

Attention removes this problem entirely: the output at every position can directly read from every input position. There is no compression step. The "context vector" becomes a full matrix of all hidden states.

Summary of failure modes:
N-grams: no memory beyond N words · Feed-forward LMs: fixed context window · RNNs: vanishing gradients, fixed-size bottleneck · LSTMs: same bottleneck, slower training · Seq2Seq: bottleneck exposed at sentence boundary

Attention's answer: don't compress — let every position talk to every other position directly.

Tokenization →