Blocking the future, line by line
Recall the
scaled
dot-product attention for a sequence of length n: from
queries Q and keys K we form a score
matrix and softmax each row into attention weights. We intervene on one line — the
scores, just before the softmax.
Step 1 — start from the raw scores. Entry
S_{ij} is how much query position i
wants to attend to key position j:
S_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}.
Position i reading position j with
j > i means a present token consulting a future token —
exactly what we must stop.
Step 2 — build the causal mask. Define an
n \times n matrix M that is
-\infty strictly above the diagonal (the future) and
0 on and below it (the present and past):
M_{ij} = \begin{cases} 0 & j \le i \quad(\text{present or past}) \\ -\infty & j > i \quad(\text{future}). \end{cases}
Step 3 — add the mask to the scores. Adding
0 leaves the allowed entries untouched; adding
-\infty drives every future entry to -\infty:
(S + M)_{ij} = \begin{cases} S_{ij} & j \le i \\ -\infty & j > i. \end{cases}
Step 4 — softmax kills the future exactly. Apply the
softmax
along each row. Because e^{-\infty} = 0, every masked entry
contributes nothing to the normaliser and receives weight exactly zero — not small,
not approximately, but 0:
A_{ij} = \operatorname{softmax}_j (S + M)_{ij} = \frac{e^{\,S_{ij}}\,\mathbf{1}[\,j \le i\,]}{\sum_{k \le i} e^{\,S_{ik}}}, \qquad A_{ij} = 0 \ \text{ for } j > i.
Step 5 — read off the shape. Every weight above the diagonal is zero, so the
attention matrix A is lower-triangular. Each row
i is a clean probability distribution over only positions
1, \dots, i:
\sum_{j=1}^{i} A_{ij} = 1, \qquad A_{ij} = 0 \text{ for all } j > i.
Position i attends to itself and everything before it, and to
nothing after. The future is sealed off.
Why this is the whole point of parallel training
Here is the payoff that made the transformer a training rocket. A naïve autoregressive model
would have to generate token by token even during training, one slow step at a time. But with
the mask in place, position i's prediction already depends
only on 1, \dots, i — so we can feed the model the
entire target sequence at once and compute all
n next-token predictions in a single forward pass. The mask
guarantees prediction i never cheats by seeing token
i+1, even though that token is sitting right there in the batch.
Left-to-right structure preserved; full GPU parallelism unlocked.
Adding the mask M (with
M_{ij} = -\infty for j > i and
0 otherwise) to the attention scores before the softmax has three
consequences:
-
The future is set to -\infty. Every score
S_{ij} with j > i becomes
-\infty, while present/past scores are unchanged
(+0).
-
\operatorname{softmax}(-\infty) = 0 \Rightarrow
lower-triangular. Since e^{-\infty} = 0, every future
weight is exactly zero, so A is lower-triangular and
each row i is a distribution over positions
1, \dots, i only.
-
Parallel training of all next-token predictions. Because prediction
i depends only on positions
\le i, the whole sequence's
n next-token predictions can be computed in one forward pass,
with no leakage from future tokens — fully parallel training that still respects
left-to-right order.
The mask is a design knob, and which way you set it decides what kind of model you are
building.
No mask (bidirectional). Every position attends to every other, future
included. The model sees the whole context at once, which is ideal for
understanding a fixed input — filling in a blanked-out word needs the words on
both sides. This is the encoder reading of a sentence. What it cannot do is
generate: if position i may look at
i+1, "predict the next token" is meaningless because the answer
is already visible.
Causal mask (left-to-right). Each position sees only itself and the past,
so "predict the next token" is a genuine, leak-free task at every position simultaneously.
This is the decoder reading, and the foundation of every
generative
language model. You trade away two-sided context for the ability to write the
future one token at a time — exactly the trade a generator wants.