Strip away the hype and a language model is one humble promise: given the
text so far, it tells you how probable each possible next token is. Nothing more.
Formally, for a vocabulary
From that one ability everything else follows — finishing your sentence, answering your question, writing the essay. The job of this page is to turn that promise into an objective a network can actually be trained on, line by line.
We want the probability the model assigns to an entire sequence
Step 1 — factorise the sequence by the chain rule. The probability of a joint event is the product of conditionals, taken left to right. This is exact — pure probability, no modelling assumption yet:
Read it aloud: the probability of the whole sentence is the probability of the first word,
times the probability of the second given the first, times the third given the
first two, and so on. A sequence model is therefore nothing but a machine for the
single factor
Step 2 — let one network emit every factor at once. A decoder-only
transformer with
Because the mask blocks the future, all
Step 3 — score a corpus by log-likelihood. Take the logarithm of the chain
rule (turning the product into a friendlier sum) and ask the parameters
Step 4 — maximise log-likelihood = minimise average cross-entropy. Flip the
sign and divide by the length. Maximising the log-likelihood is identical to minimising the
mean negative log-likelihood, which is precisely the
At each position the target is a one-hot vector on the actual next token, so its
cross-entropy against
There is no extra magic above the next-token softmax. A language model is the autocomplete
on your phone keyboard, scaled up until the “next word” it suggests is good
enough to write code, summarise a report, or hold a conversation. To generate, you sample
one token from
The surprise of the last decade is how far that single objective stretches: predict the
next token well enough, over enough text, and grammar, facts, translation, and a sketch of
reasoning all emerge as side effects of getting the autocomplete right. We measure
“how good” with
For the context “the cat sat on the” a model spreads its probability
mass over the vocabulary — lots on mat and floor, a sliver on
banana. Each bar is one