Text can't be fed directly into a neural network. Tokenization converts raw text into a sequence of integers that index into a learned embedding table — defining the vocabulary the model operates on.
Before a model processes text it passes through four stages: the raw string is split into tokens (sub-word pieces), each token is looked up in a vocabulary to get an integer ID, and the model converts each ID to a dense vector (embedding) via a learned table.
Type or edit text below to see it tokenised in real time. This uses a word-level approximation — real BPE tokenisers (like GPT-4's tiktoken) split at the subword level, but the structure is identical.
function_name — underscores split word piecesBPE builds a vocabulary by iteratively merging the most frequent adjacent pair of tokens. Starting from individual characters, each merge step creates a new subword unit. After enough merges the vocabulary covers common words as single tokens.
Corpus used below: "low" (×5), "lower" (×2), "newest" (×6), "wider" (×3). The algorithm finds the most-frequent adjacent pair and merges it everywhere.
The vocabulary size controls the tradeoff between token sequence length and embedding table size. Larger vocabularies mean fewer tokens per sentence (faster generation) but a larger embedding table (more memory).
| Model | Tokeniser | Vocab size | Avg tokens / word |
|---|---|---|---|
| GPT-2 | BPE | 50,257 | ~1.3 |
| GPT-3 / GPT-4 | tiktoken (cl100k) | 100,277 | ~1.2 |
| Llama 3 | tiktoken | 128,256 | ~1.1 |
| BERT | WordPiece | 30,522 | ~1.4 |
| T5 | SentencePiece | 32,100 | ~1.4 |
Every tokeniser reserves a set of special tokens that are never split. They signal boundaries, padding positions, and task structure to the model.
| Token | Name | Used for |
|---|---|---|
| <|bos|> | Begin of sequence | Marks the start of a generation context |
| <|eos|> | End of sequence | Signals the model to stop generating |
| [PAD] | Padding | Fills short sequences in a batch to equal length |
| [MASK] | Mask | Corrupted position in masked LM training (BERT, diffusion) |
| [SEP] | Separator | Divides two segments (e.g. question + passage in BERT) |
| [UNK] | Unknown | Any character sequence outside the vocabulary |
| <|im_start|> | Chat turn start | Marks the beginning of a user/assistant turn (ChatML format) |