Two objectives, line by line
Step 1 — masked LM hides a random subset. Pick a random set of positions
\mathcal{M} \subset \{1, \dots, T\} — conventionally about
15\% of the tokens — replace each with a special
\texttt{[MASK]} symbol, and ask the model to restore the
originals. The loss runs only over the masked positions:
\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{t \in \mathcal{M}} \log P_\theta\big(x_t \mid x_{\setminus \mathcal{M}}\big).
The conditioning context x_{\setminus \mathcal{M}} is
everything that was not masked — the words on the left and the
right of each blank.
Step 2 — masked LM uses bidirectional attention. Because a blank may be
filled using its right context, there is no causal mask: every position attends to
every other, future included.
A_{ij} \ \text{may be nonzero for all } j \qquad (\text{no causal mask}).
Filling “the cat \texttt{[MASK]} on the mat” clearly
needs mat, which sits to the right — so two-sided context is exactly the point. This
makes MLM superb for understanding a fixed input (classification, retrieval,
tagging). What it cannot do is generate left to right: every position can already see the
answer, so “predict the next token” is meaningless.
Step 3 — causal LM predicts the next token from the left only. No blanks;
instead, at every position predict the token that follows, using only what came before — the
ordinary
language-modeling
objective:
\mathcal{L}_{\text{CLM}} = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta\big(x_t \mid x_{
Step 4 — causal LM uses a causal mask. Position t
may attend only to positions \le t, so the attention matrix is
lower-triangular —
causal
masking again:
A_{ij} = 0 \ \text{ for } j > i \qquad (\text{lower-triangular}).
Two consequences fall out. First, the model can generate: with the future
sealed off, “predict the next token” is a genuine, leak-free task, so you sample
one token, append it, and repeat. Second — the gift of the mask — every position is a valid
prediction simultaneously, so all T next-token losses train
at once in a single parallel forward pass.
Step 5 — why causal LM is the basis of generative LLMs. Line the two up.
MLM conditions on both sides, so it understands but cannot write the next word; CLM
conditions on the left, so it writes the next word, one at a time, indefinitely.
Generation is repeated next-token prediction, and only the causal objective supports
it — which is why every GPT-style large language model is a causal LM:
\text{MLM: } x_t \mid x_{\setminus \mathcal{M}} \ (\text{both sides}) \qquad\Longleftrightarrow\qquad \text{CLM: } x_t \mid x_{
Both objectives use free labels and cross-entropy; they differ only in the mask.
-
Masked LM — bidirectional, for understanding. Hide ~15% of tokens and
predict each from both sides
(x_t \mid x_{\setminus \mathcal{M}}) with no causal mask;
ideal for understanding a fixed input, but it cannot generate.
-
Causal LM — left-to-right, for generation. Predict each next token from
the left only (x_t \mid x_{); enables autoregressive
generation and trains all T positions in one parallel pass.
-
The difference is the mask. MLM uses full (bidirectional) attention;
CLM adds the causal mask so A is lower-triangular
(A_{ij} = 0 for j > i). That one
choice is why GPT generates and BERT classifies.
Masked and causal are the famous endpoints, but the mask is a dial, not a switch.
Prefix-LM splits the sequence: a leading prefix is read
bidirectionally (every prefix token sees every other — like an encoder/MLM), while the
continuation is generated causally (each token sees the whole prefix plus the part of the
tail already produced). The attention mask is a block pattern — a full rectangle over the
prefix, lower-triangular over the tail:
A_{ij} \ne 0 \iff \big(j \le \ell\big) \ \text{ or } \ \big(j \le i\big),
where \ell is the prefix length. You get two-sided understanding
of the prompt and left-to-right generation of the answer — the best of both, which
is why encoder-decoder and prefix-LM designs (T5, UL2) sit comfortably between the
extremes. Pure
encoder
vs decoder models are just the two ends of this same mask spectrum.
The two masks, side by side
Each grid is an attention pattern: a coloured cell at row i,
column j means position i may attend to
position j. On the left, masked LM — the whole
grid is active, so every token reads both directions. On the right, causal
LM — only the lower triangle (j \le i) is active; the upper
triangle (the future) is dark. Same network, same loss; the only difference is which cells
are switched on.