Reference

Tokenization

Text can't be fed directly into a neural network. Tokenization converts raw text into a sequence of integers that index into a learned embedding table — defining the vocabulary the model operates on.

Text → tokens → IDs → embeddings

Before a model processes text it passes through four stages: the raw string is split into tokens (sub-word pieces), each token is looked up in a vocabulary to get an integer ID, and the model converts each ID to a dense vector (embedding) via a learned table.

The four-stage pipeline from raw text to neural network input

Why not words? A word-level vocabulary must include every word the model will ever see — hundreds of thousands of entries, plus it fails on typos, new words, and code. Character-level avoids this but produces very long sequences with no semantic units. Subword tokenisation (BPE, WordPiece, SentencePiece) finds the middle ground: common words are single tokens, rare words are split into recognisable pieces.

Live tokeniser

Type or edit text below to see it tokenised in real time. This uses a word-level approximation — real BPE tokenisers (like GPT-4's tiktoken) split at the subword level, but the structure is identical.

Tokens — each chip is one token, number below is its ID

Start typing…

Tokenisation quirks to try:

Numbers: "2024" — often split into "20" + "24" or individual digits
Rare words: "tokenisation" vs "tokenization" — different splits by region
Code: function_name — underscores split word pieces
Spaces: GPT tokenisers attach the leading space to the next word — "▁the"

Byte-Pair Encoding — step by step

BPE builds a vocabulary by iteratively merging the most frequent adjacent pair of tokens. Starting from individual characters, each merge step creates a new subword unit. After enough merges the vocabulary covers common words as single tokens.

Corpus used below: "low" (×5), "lower" (×2), "newest" (×6), "wider" (×3). The algorithm finds the most-frequent adjacent pair and merges it everywhere.

Result after all merges: "low" = one token, "lower" = low + er, "newest" = new + est, "wider" = w + id + er. The vocabulary has grown from 11 individual characters to include 7 merged subwords — and common words are now single tokens.

Vocabulary size in practice

The vocabulary size controls the tradeoff between token sequence length and embedding table size. Larger vocabularies mean fewer tokens per sentence (faster generation) but a larger embedding table (more memory).

Model	Tokeniser	Vocab size	Avg tokens / word
GPT-2	BPE	50,257	~1.3
GPT-3 / GPT-4	tiktoken (cl100k)	100,277	~1.2
Llama 3	tiktoken	128,256	~1.1
BERT	WordPiece	30,522	~1.4
T5	SentencePiece	32,100	~1.4

Why GPT-4 doubled the vocabulary. A larger vocabulary means fewer tokens per document — which directly reduces the number of forward passes at inference time. For long-context tasks this has a measurable speed and cost impact. The downside is a proportionally larger embedding matrix.

Special tokens

Every tokeniser reserves a set of special tokens that are never split. They signal boundaries, padding positions, and task structure to the model.

Token	Name	Used for
<\|bos\|>	Begin of sequence	Marks the start of a generation context
<\|eos\|>	End of sequence	Signals the model to stop generating
[PAD]	Padding	Fills short sequences in a batch to equal length
[MASK]	Mask	Corrupted position in masked LM training (BERT, diffusion)
[SEP]	Separator	Divides two segments (e.g. question + passage in BERT)
[UNK]	Unknown	Any character sequence outside the vocabulary
<\|im_start\|>	Chat turn start	Marks the beginning of a user/assistant turn (ChatML format)

Tokens shape model behaviour. The <|eos|> token is what actually stops generation — the model learns to predict it as the final token. Without it, the model would generate indefinitely. Chat models are fine-tuned to produce <|eos|> at the end of each assistant turn.

← Attention Mechanism The Transformer →