Language Modeling

Strip away the hype and a language model is one humble promise: given the text so far, it tells you how probable each possible next token is. Nothing more. Formally, for a vocabulary \mathcal{V} and a context x_{, the model outputs the conditional distribution

P(x_t \mid x_{

From that one ability everything else follows — finishing your sentence, answering your question, writing the essay. The job of this page is to turn that promise into an objective a network can actually be trained on, line by line.

From a next-token guess to a whole-sequence objective

We want the probability the model assigns to an entire sequence x_1, \dots, x_T — and then a loss that rewards making real text probable. Three moves get us there.

Step 1 — factorise the sequence by the chain rule. The probability of a joint event is the product of conditionals, taken left to right. This is exact — pure probability, no modelling assumption yet:

P(x_1, x_2, \dots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \dots, x_{t-1}) = \prod_{t=1}^{T} P(x_t \mid x_{

Read it aloud: the probability of the whole sentence is the probability of the first word, times the probability of the second given the first, times the third given the first two, and so on. A sequence model is therefore nothing but a machine for the single factor P(x_t \mid x_{ — produce that well and the product takes care of itself.

Step 2 — let one network emit every factor at once. A decoder-only transformer with causal masking guarantees that position t's output depends only on x_{\le t} — never on the future it is meant to predict. At each position it produces a logit vector z_t \in \mathbb{R}^{|\mathcal{V}|}, and the softmax turns it into a genuine distribution over the vocabulary:

P_\theta(x_t = w \mid x_{

Because the mask blocks the future, all T of these next-token distributions are computed in a single forward pass — exactly the factors the chain rule asked for, one per position, in parallel.

Step 3 — score a corpus by log-likelihood. Take the logarithm of the chain rule (turning the product into a friendlier sum) and ask the parameters \theta to make the observed text as probable as possible:

\log P_\theta(x_1, \dots, x_T) = \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{

Step 4 — maximise log-likelihood = minimise average cross-entropy. Flip the sign and divide by the length. Maximising the log-likelihood is identical to minimising the mean negative log-likelihood, which is precisely the cross-entropy loss of the true next token against the model's predicted distribution:

\mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{

At each position the target is a one-hot vector on the actual next token, so its cross-entropy against \operatorname{softmax}(z_t) is simply -\log P_\theta(x_t \mid x_{ — surprise at the token that truly came next. Average that surprise over every position, push it downhill with gradient descent, and you have trained a language model. One loss, derived from one rule of probability.

A language model parameterises P_\theta(x_t \mid x_{ over a vocabulary \mathcal{V}. Then:

Chain-rule factorisation. Any sequence factors exactly as P(x_1, \dots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{, so modelling text reduces to modelling the single next-token factor.
Per-position softmax over the vocabulary. A causally-masked decoder emits a logit vector z_t at each position and P_\theta(x_t = w \mid x_{ — all T distributions in one parallel pass.
Cross-entropy / NLL loss. Maximising \log P_\theta equals minimising the average negative log-likelihood \mathcal{L}(\theta) = -\frac{1}{T}\sum_t \log P_\theta(x_t \mid x_{, the mean next-token cross-entropy.

There is no extra magic above the next-token softmax. A language model is the autocomplete on your phone keyboard, scaled up until the “next word” it suggests is good enough to write code, summarise a report, or hold a conversation. To generate, you sample one token from P_\theta(\cdot \mid x_{, append it to the context, and repeat — feeding each guess back in as the next input. This loop-on-your-own-output is what “autoregressive” means, and it is why the same model that scores text can also write it.

The surprise of the last decade is how far that single objective stretches: predict the next token well enough, over enough text, and grammar, facts, translation, and a sketch of reasoning all emerge as side effects of getting the autocomplete right. We measure “how good” with perplexity, and we get the free training labels from self-supervised pretraining.

The next-token distribution, drawn

For the context “the cat sat on the” a model spreads its probability mass over the vocabulary — lots on mat and floor, a sliver on banana. Each bar is one P_\theta(x_t = w \mid x_{; the bars are non-negative and sum to 1, because they are a softmax. Slide confidence up to sharpen the distribution toward the most likely word (a peaked, decisive model) or down to flatten it toward a hedging, uncertain one.