From the attention mechanism through autoregressive transformers to diffusion language models — built from first principles with interactive simulations.
Before Attention
Foundations
Reference
How text becomes integers. BPE merge steps, live tokeniser demo, vocabulary tradeoffs, and special tokens.
BPEVocabularySubwordLesson 01
Queries, keys and values. How dot-product attention lets every token communicate with every other, and why it works.
QKVSoftmaxMulti-HeadLesson 02
Layer norm, feed-forward layers, residual connections, and the causal mask. How transformer blocks stack into a full model.
ArchitectureResidualsCausal MaskLesson 03
Next-token prediction, temperature sampling, KV cache, and context windows. How a trained transformer actually generates text.
SamplingKV CacheContextScaling & Alternatives
Lesson 04
Why autoregressive generation is slow despite fast hardware. Memory bandwidth, batching, and speculative decoding.
Memory BandwidthBatchingSpeculativeLesson 05
Generating all tokens in parallel using masked diffusion. Why this maps perfectly to GPU hardware and what it trades away.
MDLMParallelGPU