Subword Tokenization

A neural network does not read text — it reads numbers. So before any attention can happen, a string of sequence data must become a sequence of integer tokens, each an index into a fixed vocabulary. The only question is: what counts as one token? Get this wrong and everything downstream pays for it.

Two tempting extremes, both wrong

Word-level. Make each whole word a token. Clean and intuitive — until you meet the real world. Human language has a heavy tail: there are hundreds of thousands of words, names, typos, and compounds, so the vocabulary balloons. Worse, any word you didn't see at training time — "antidisestablishmentarianism", a new product name, a misspelling — has no index at all. It becomes the dreaded out-of-vocabulary token \langle\text{UNK}\rangle, and its meaning is simply lost.

Character-level. Swing the other way: make each character a token. Now the vocabulary is tiny (a few hundred symbols) and nothing is ever out-of-vocabulary — every string is spellable. But the price is brutal length: a five-word sentence becomes thirty-plus tokens, and since attention costs grow with sequence length, you have made every sentence far more expensive to process while forcing the model to relearn that c-a-t spells a familiar idea.

Subword. The sweet spot keeps the best of both: frequent words stay whole (so sequences stay short and common meanings are one token), while rare words split into reusable pieces (so nothing is ever truly out-of-vocabulary). "tokenization" might become token + ization; an unseen word still decomposes into known fragments. This is what every modern transformer actually uses — under names like BPE, WordPiece, and SentencePiece.

Byte-Pair Encoding, line by line

The canonical subword algorithm is Byte-Pair Encoding (BPE). It is almost absurdly simple: start from characters, then greedily glue together the pair that occurs most often, over and over, growing the vocabulary one merge at a time. Take a tiny corpus of four words with their counts:

\texttt{low}\times 5,\quad \texttt{lower}\times 2,\quad \texttt{newest}\times 6,\quad \texttt{widest}\times 3.

Step 0 — start from characters. Every word is split into its characters, and the vocabulary is just the set of distinct characters seen:

\texttt{l o w},\quad \texttt{l o w e r},\quad \texttt{n e w e s t},\quad \texttt{w i d e s t}.

Step 1 — count every adjacent pair. Scan all words, weighting each pair by its word's count. The pair \texttt{e\,s} appears in \texttt{newest}\,(6) and \texttt{widest}\,(3), a total of 9 — more than any other pair. Merge it into the new token \texttt{es}:

\texttt{n e w es t},\qquad \texttt{w i d es t}.

Step 2 — recount, merge again. Now the pair \texttt{es\,t} appears 9 times (still both words). Merge it into \texttt{est}:

\texttt{n e w est},\qquad \texttt{w i d est}.

Step 3 — and again. The pair \texttt{l\,o} appears in \texttt{low}\,(5) and \texttt{lower}\,(2), a total of 7 — now the most frequent. Merge into \texttt{lo}:

\texttt{lo w},\qquad \texttt{lo w e r}.

Step 4 — stop at the target vocabulary size. Keep merging until the vocabulary reaches a chosen budget (say 30{,}000 tokens). The learned ordered list of merges is the tokenizer: to tokenize new text, you re-apply those same merges in the same order. Common sequences like \texttt{est} have become single tokens; rare words gracefully fall back to smaller pieces.

Notice the rule never had to know what a "word" is. It just counted adjacency and merged the winner — yet out fell a vocabulary in which the frequent stays whole and the rare stays spellable.

Text is encoded as a sequence of integer tokens drawn from a fixed vocabulary. The granularity is a trade-off:

The trade-off. Word tokens give short sequences but an enormous vocabulary and out-of-vocabulary failures; character tokens give a tiny vocabulary and zero out-of-vocabulary but very long sequences. Subword tokens sit between: frequent words stay whole, rare words split into reusable pieces.
The BPE merge rule. Start from characters; repeatedly find the most frequent adjacent pair and merge it into a new token, until the vocabulary reaches a target size. The ordered list of merges is the tokenizer.
Coverage. Because every word ultimately decomposes to characters, a subword tokenizer can represent any string — there is no out-of-vocabulary token.

Tokens are not words, and they are emphatically not letters. The string "tokenizer" may well arrive at the model as three tokens — token, iz, er — and the word "strawberry" as two, say straw and berry. The model never sees the individual characters; it sees opaque integer ids. So when you ask "how many r's are in strawberry?", the model has no direct view of the letters at all — it must reason about the spelling of chunks it only knows by index. This is also why token counts (and billing) rarely match word counts: a sentence of ten words can be eight tokens or eighteen, depending entirely on how the merges happened to chop it up.

Watch a word collapse into subword tokens

Here is the word "newest" as a row of cells, one per current token. Step forward to apply each BPE merge in turn: two adjacent cells fuse into a single wider token. The character soup n e w e s t collapses, merge by merge, toward the compact subword tokens a real tokenizer would emit.