← Back
Reference

Tokenization

Text can't be fed directly into a neural network. Tokenization converts raw text into a sequence of integers that index into a learned embedding table — defining the vocabulary the model operates on.

Text → tokens → IDs → embeddings

Before a model processes text it passes through four stages: the raw string is split into tokens (sub-word pieces), each token is looked up in a vocabulary to get an integer ID, and the model converts each ID to a dense vector (embedding) via a learned table.

The four-stage pipeline from raw text to neural network input
Why not words? A word-level vocabulary must include every word the model will ever see — hundreds of thousands of entries, plus it fails on typos, new words, and code. Character-level avoids this but produces very long sequences with no semantic units. Subword tokenisation (BPE, WordPiece, SentencePiece) finds the middle ground: common words are single tokens, rare words are split into recognisable pieces.

Live tokeniser

Type or edit text below to see it tokenised in real time. This uses a word-level approximation — real BPE tokenisers (like GPT-4's tiktoken) split at the subword level, but the structure is identical.

Tokens — each chip is one token, number below is its ID
Start typing…
Tokenisation quirks to try:
  • Numbers: "2024" — often split into "20" + "24" or individual digits
  • Rare words: "tokenisation" vs "tokenization" — different splits by region
  • Code: function_name — underscores split word pieces
  • Spaces: GPT tokenisers attach the leading space to the next word — "▁the"

Byte-Pair Encoding — step by step

BPE builds a vocabulary by iteratively merging the most frequent adjacent pair of tokens. Starting from individual characters, each merge step creates a new subword unit. After enough merges the vocabulary covers common words as single tokens.

Corpus used below: "low" (×5), "lower" (×2), "newest" (×6), "wider" (×3). The algorithm finds the most-frequent adjacent pair and merges it everywhere.

Result after all merges: "low" = one token, "lower" = low + er, "newest" = new + est, "wider" = w + id + er. The vocabulary has grown from 11 individual characters to include 7 merged subwords — and common words are now single tokens.

Vocabulary size in practice

The vocabulary size controls the tradeoff between token sequence length and embedding table size. Larger vocabularies mean fewer tokens per sentence (faster generation) but a larger embedding table (more memory).

ModelTokeniserVocab sizeAvg tokens / word
GPT-2BPE50,257~1.3
GPT-3 / GPT-4tiktoken (cl100k)100,277~1.2
Llama 3tiktoken128,256~1.1
BERTWordPiece30,522~1.4
T5SentencePiece32,100~1.4
Why GPT-4 doubled the vocabulary. A larger vocabulary means fewer tokens per document — which directly reduces the number of forward passes at inference time. For long-context tasks this has a measurable speed and cost impact. The downside is a proportionally larger embedding matrix.

Special tokens

Every tokeniser reserves a set of special tokens that are never split. They signal boundaries, padding positions, and task structure to the model.

TokenNameUsed for
<|bos|>Begin of sequenceMarks the start of a generation context
<|eos|>End of sequenceSignals the model to stop generating
[PAD]PaddingFills short sequences in a batch to equal length
[MASK]MaskCorrupted position in masked LM training (BERT, diffusion)
[SEP]SeparatorDivides two segments (e.g. question + passage in BERT)
[UNK]UnknownAny character sequence outside the vocabulary
<|im_start|>Chat turn startMarks the beginning of a user/assistant turn (ChatML format)
Tokens shape model behaviour. The <|eos|> token is what actually stops generation — the model learns to predict it as the final token. Without it, the model would generate indefinitely. Chat models are fine-tuned to produce <|eos|> at the end of each assistant turn.