A neural network does not read text — it reads numbers. So before any
Word-level. Make each whole word a token. Clean and intuitive — until you meet
the real world. Human language has a heavy tail: there are hundreds of thousands of words, names,
typos, and compounds, so the vocabulary balloons. Worse, any word you didn't see at training time
— "antidisestablishmentarianism", a new product name, a misspelling — has no index at
all. It becomes the dreaded out-of-vocabulary token
Character-level. Swing the other way: make each character a token. Now the
vocabulary is tiny (a few hundred symbols) and nothing is ever out-of-vocabulary — every
string is spellable. But the price is brutal length: a five-word sentence becomes thirty-plus
tokens, and since attention costs grow with sequence length, you have made every sentence far more
expensive to process while forcing the model to relearn that c-a-t spells a familiar
idea.
Subword. The sweet spot keeps the best of both: frequent words stay whole (so
sequences stay short and common meanings are one token), while rare words split into reusable
pieces (so nothing is ever truly out-of-vocabulary). "tokenization" might become
token + ization; an unseen word still decomposes into known fragments.
This is what every modern transformer actually uses — under names like BPE,
WordPiece, and SentencePiece.
The canonical subword algorithm is Byte-Pair Encoding (BPE). It is almost absurdly simple: start from characters, then greedily glue together the pair that occurs most often, over and over, growing the vocabulary one merge at a time. Take a tiny corpus of four words with their counts:
Step 0 — start from characters. Every word is split into its characters, and the vocabulary is just the set of distinct characters seen:
Step 1 — count every adjacent pair. Scan all words, weighting each pair by its
word's count. The pair
Step 2 — recount, merge again. Now the pair
Step 3 — and again. The pair
Step 4 — stop at the target vocabulary size. Keep merging until the vocabulary
reaches a chosen budget (say
Notice the rule never had to know what a "word" is. It just counted adjacency and merged the winner — yet out fell a vocabulary in which the frequent stays whole and the rare stays spellable.
Tokens are not words, and they are emphatically not letters. The string
"tokenizer" may well arrive at the model as three tokens —
token, iz, er — and the word "strawberry" as
two, say straw and berry. The model never sees the individual
characters; it sees opaque integer ids. So when you ask "how many r's are in
strawberry?", the model has no direct view of the letters at all — it must reason about
the spelling of chunks it only knows by index. This is also why token counts (and
billing) rarely match word counts: a sentence of ten words can be eight tokens or eighteen,
depending entirely on how the merges happened to chop it up.
Here is the word "newest" as a row of cells, one per current token. Step forward to apply
each BPE merge in turn: two adjacent cells fuse into a single wider token. The character soup
n e w e s t collapses, merge by merge, toward the compact subword tokens a real
tokenizer would emit.