Self-attention has a beautiful symmetry — and a fatal blind spot. It treats its input as a
set, not a sequence. Permute the tokens and the outputs just permute the same
way: attention has no built-in notion of order. But order is meaning —
"dog bites man" is not "man bites dog". So we must inject position information,
and the elegant trick is to simply add it to the token
Make the claim precise. Attention builds each output as a weighted average of values, where the
weights come from query–key matches — and a sum doesn't care about order. If
Nothing inside the mechanism distinguishes "the token at position 1" from "the token at position 5". Left to itself, the transformer would read a sentence as a bag of words. We have to tell it where each token sits.
Give every position
Step 1 — define a frequency per dimension pair. Pair up the
As
Step 2 — fill the even coordinates with sines. Each even dimension is a sine of position at that pair's frequency:
Step 3 — fill the odd coordinates with cosines. Each odd dimension is the cosine partner at the same frequency:
Step 4 — read the position off as a fingerprint. Stand at one position and read
its vector across dimensions: the fast sinusoids flip quickly, the slow ones barely move — exactly
like the bits of a binary counter, where the low bit toggles every step and the high bit toggles
rarely. Each
Step 5 — add it to the embedding. Finally, just sum: the input to attention for
the token at position
No new parameters, no architectural change — the order signal rides along inside the same vectors the model already processes.
There is a hidden gift in choosing sine and cosine. By the angle-addition formulae, the encoding
at
Because the shift-by-
Adding a fixed code is not the only way. A popular modern alternative, rotary position
embeddings, takes the rotation we just saw and applies it directly to the query and
key vectors instead of summing a code into the embedding. Each pair of coordinates is
rotated by an angle proportional to the token's position, so that the dot product
between a query at position
Each curve is one dimension of the positional code, plotted against position