Lesson 02

The Transformer

Stack attention and feed-forward layers with residual connections and layer normalisation. The transformer block is the building block of every modern language model.

The Transformer Block

A single transformer block takes a sequence of token embeddings and outputs a transformed sequence of the same shape. It has two sub-layers — multi-head attention and a feed-forward network — each wrapped with a residual connection and layer normalisation.

The residual connections (also called skip connections) allow gradients to flow directly through the network during training, making it possible to stack many layers without vanishing gradients. Layer norm stabilises the activations after each sub-layer.

The Causal Mask

GPT-style decoder transformers use causal masking: each token can only attend to itself and earlier tokens. This enables left-to-right generation — the model never "peeks" at future tokens during training, which would make the task trivially easy and useless for generation.

During training, the entire sequence is processed in parallel (unlike RNNs), but the mask ensures causality. During inference, tokens are generated one at a time, each conditioned on all previous outputs.

Click a row to inspect that query position's attention pattern

Depth — Stacking Blocks

A full transformer stacks N blocks sequentially. The output of block N becomes the input of block N+1 — the same sequence of vectors passes through every layer from Embedding up to the LM Head. Each block can only add to the representation (via its residual connection); it never starts from scratch.

The depth creates a hierarchy of abstraction. Early blocks capture low-level patterns (word shapes, part-of-speech), middle blocks encode phrases and named entities, and late blocks encode higher-order semantics and world knowledge.

Reading the diagram: data flows bottom → top. Arrows show the output of each block feeding into the next. The three colour zones show rough depth specialisations observed in trained models.

Embedding: ~39M · Per block: ~7M × 12 = ~85M · Total: ~124M (GPT-2 small)

Encoder vs Decoder vs Encoder-Decoder

Transformers come in three flavours, each suited to different tasks. The attention masking strategy is the key difference.

Flavour	Mask	Example	Used for
Encoder	Bidirectional (full attention)	BERT, RoBERTa	Classification, embeddings, understanding
Decoder	Causal (left-to-right only)	GPT-4, Llama, Claude	Text generation, chat, completion
Encoder-Decoder	Encoder: full, Decoder: causal	T5, BART	Translation, summarisation, seq2seq

Modern frontier models (GPT-4, Claude, Gemini, Llama) are all decoder-only transformers. The simplicity of the decoder architecture scales better with compute — no need to balance a separate encoder and decoder, and the causal objective naturally supports open-ended generation.

← Attention Mechanism Autoregressive Generation →