← Back

Learning LLMs

From the attention mechanism through autoregressive transformers to diffusion language models — built from first principles with interactive simulations.

Before Attention

📜

Lesson 00

Before Attention

N-grams, RNNs, LSTMs, and the Seq2Seq bottleneck — the sequence models that attention was designed to replace.

N-gramRNNLSTMSeq2Seq

Foundations

🔤

Reference

Tokenization

How text becomes integers. BPE merge steps, live tokeniser demo, vocabulary tradeoffs, and special tokens.

BPEVocabularySubword
🔍

Lesson 01

Attention Mechanism

Queries, keys and values. How dot-product attention lets every token communicate with every other, and why it works.

QKVSoftmaxMulti-Head
🏗️

Lesson 02

The Transformer

Layer norm, feed-forward layers, residual connections, and the causal mask. How transformer blocks stack into a full model.

ArchitectureResidualsCausal Mask
➡️

Lesson 03

Autoregressive Generation

Next-token prediction, temperature sampling, KV cache, and context windows. How a trained transformer actually generates text.

SamplingKV CacheContext

Scaling & Alternatives

⏱️

Lesson 04

Limits of Serial Decoding

Why autoregressive generation is slow despite fast hardware. Memory bandwidth, batching, and speculative decoding.

Memory BandwidthBatchingSpeculative
🌀

Lesson 05

Diffusion Language Models

Generating all tokens in parallel using masked diffusion. Why this maps perfectly to GPU hardware and what it trades away.

MDLMParallelGPU