Masked vs Causal Language Modeling

Self-supervised pretraining said “the text is its own label”. But there is more than one way to hide part of the text and ask the model to recover it — and the choice you make decides what kind of model you build. Two objectives dominate, and they correspond to the two readings of attention you have already met:

They share the cross-entropy loss and the free labels; they differ entirely in which tokens each prediction is allowed to see. That single difference is a mask.

Two objectives, line by line

Step 1 — masked LM hides a random subset. Pick a random set of positions \mathcal{M} \subset \{1, \dots, T\} — conventionally about 15\% of the tokens — replace each with a special \texttt{[MASK]} symbol, and ask the model to restore the originals. The loss runs only over the masked positions:

\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{t \in \mathcal{M}} \log P_\theta\big(x_t \mid x_{\setminus \mathcal{M}}\big).

The conditioning context x_{\setminus \mathcal{M}} is everything that was not masked — the words on the left and the right of each blank.

Step 2 — masked LM uses bidirectional attention. Because a blank may be filled using its right context, there is no causal mask: every position attends to every other, future included.

A_{ij} \ \text{may be nonzero for all } j \qquad (\text{no causal mask}).

Filling “the cat \texttt{[MASK]} on the mat” clearly needs mat, which sits to the right — so two-sided context is exactly the point. This makes MLM superb for understanding a fixed input (classification, retrieval, tagging). What it cannot do is generate left to right: every position can already see the answer, so “predict the next token” is meaningless.

Step 3 — causal LM predicts the next token from the left only. No blanks; instead, at every position predict the token that follows, using only what came before — the ordinary language-modeling objective:

\mathcal{L}_{\text{CLM}} = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta\big(x_t \mid x_{

Step 4 — causal LM uses a causal mask. Position t may attend only to positions \le t, so the attention matrix is lower-triangular — causal masking again:

A_{ij} = 0 \ \text{ for } j > i \qquad (\text{lower-triangular}).

Two consequences fall out. First, the model can generate: with the future sealed off, “predict the next token” is a genuine, leak-free task, so you sample one token, append it, and repeat. Second — the gift of the mask — every position is a valid prediction simultaneously, so all T next-token losses train at once in a single parallel forward pass.

Step 5 — why causal LM is the basis of generative LLMs. Line the two up. MLM conditions on both sides, so it understands but cannot write the next word; CLM conditions on the left, so it writes the next word, one at a time, indefinitely. Generation is repeated next-token prediction, and only the causal objective supports it — which is why every GPT-style large language model is a causal LM:

\text{MLM: } x_t \mid x_{\setminus \mathcal{M}} \ (\text{both sides}) \qquad\Longleftrightarrow\qquad \text{CLM: } x_t \mid x_{ Both objectives use free labels and cross-entropy; they differ only in the mask.
  • Masked LM — bidirectional, for understanding. Hide ~15% of tokens and predict each from both sides (x_t \mid x_{\setminus \mathcal{M}}) with no causal mask; ideal for understanding a fixed input, but it cannot generate.
  • Causal LM — left-to-right, for generation. Predict each next token from the left only (x_t \mid x_{); enables autoregressive generation and trains all T positions in one parallel pass.
  • The difference is the mask. MLM uses full (bidirectional) attention; CLM adds the causal mask so A is lower-triangular (A_{ij} = 0 for j > i). That one choice is why GPT generates and BERT classifies.

Masked and causal are the famous endpoints, but the mask is a dial, not a switch. Prefix-LM splits the sequence: a leading prefix is read bidirectionally (every prefix token sees every other — like an encoder/MLM), while the continuation is generated causally (each token sees the whole prefix plus the part of the tail already produced). The attention mask is a block pattern — a full rectangle over the prefix, lower-triangular over the tail:

A_{ij} \ne 0 \iff \big(j \le \ell\big) \ \text{ or } \ \big(j \le i\big),

where \ell is the prefix length. You get two-sided understanding of the prompt and left-to-right generation of the answer — the best of both, which is why encoder-decoder and prefix-LM designs (T5, UL2) sit comfortably between the extremes. Pure encoder vs decoder models are just the two ends of this same mask spectrum.

The two masks, side by side

Each grid is an attention pattern: a coloured cell at row i, column j means position i may attend to position j. On the left, masked LM — the whole grid is active, so every token reads both directions. On the right, causal LM — only the lower triangle (j \le i) is active; the upper triangle (the future) is dark. Same network, same loss; the only difference is which cells are switched on.