Positional Encoding

Self-attention has a beautiful symmetry — and a fatal blind spot. It treats its input as a set, not a sequence. Permute the tokens and the outputs just permute the same way: attention has no built-in notion of order. But order is meaning — "dog bites man" is not "man bites dog". So we must inject position information, and the elegant trick is to simply add it to the token embeddings before attention ever runs.

Attention is permutation-invariant

Make the claim precise. Attention builds each output as a weighted average of values, where the weights come from query–key matches — and a sum doesn't care about order. If P is a permutation of the tokens, then attention applied to the shuffled input is the shuffled attention of the original:

\operatorname{Attention}(P X) = P\,\operatorname{Attention}(X).

Nothing inside the mechanism distinguishes "the token at position 1" from "the token at position 5". Left to itself, the transformer would read a sentence as a bag of words. We have to tell it where each token sits.

The sinusoidal encoding, line by line

Give every position \textit{pos} = 0, 1, 2, \dots its own vector PE(\textit{pos}) \in \mathbb{R}^d — same dimension as the embeddings — built from sines and cosines of geometrically spaced frequencies.

Step 1 — define a frequency per dimension pair. Pair up the d coordinates as (2i, 2i+1) for i = 0, \dots, d/2 - 1, and give pair i an angular rate

\omega_i = \frac{1}{10000^{\,2i/d}}.

As i climbs from 0 to d/2-1, \omega_i shrinks from 1 down toward 1/10000 — the wavelengths sweep from fast (a couple of positions per cycle) to very slow (thousands of positions per cycle).

Step 2 — fill the even coordinates with sines. Each even dimension is a sine of position at that pair's frequency:

PE(\textit{pos},\, 2i) = \sin\!\left(\frac{\textit{pos}}{10000^{\,2i/d}}\right).

Step 3 — fill the odd coordinates with cosines. Each odd dimension is the cosine partner at the same frequency:

PE(\textit{pos},\, 2i+1) = \cos\!\left(\frac{\textit{pos}}{10000^{\,2i/d}}\right).

Step 4 — read the position off as a fingerprint. Stand at one position and read its vector across dimensions: the fast sinusoids flip quickly, the slow ones barely move — exactly like the bits of a binary counter, where the low bit toggles every step and the high bit toggles rarely. Each \textit{pos} gets a unique pattern of sine/cosine values, a continuous "binary-clock" fingerprint of where it sits.

Step 5 — add it to the embedding. Finally, just sum: the input to attention for the token at position \textit{pos} with embedding x_{\textit{pos}} is

\tilde{x}_{\textit{pos}} = x_{\textit{pos}} + PE(\textit{pos}).

No new parameters, no architectural change — the order signal rides along inside the same vectors the model already processes.

Relative position comes for free

There is a hidden gift in choosing sine and cosine. By the angle-addition formulae, the encoding at \textit{pos}+k is a fixed linear function of the encoding at \textit{pos} — a rotation by the fixed angle \omega_i k in each pair:

\begin{pmatrix} \sin\omega_i(\textit{pos}+k) \\ \cos\omega_i(\textit{pos}+k) \end{pmatrix} = \begin{pmatrix} \cos\omega_i k & \sin\omega_i k \\ -\sin\omega_i k & \cos\omega_i k \end{pmatrix}\!\begin{pmatrix} \sin\omega_i\,\textit{pos} \\ \cos\omega_i\,\textit{pos} \end{pmatrix}.

Because the shift-by-k map doesn't depend on \textit{pos}, the model can learn to attend by relative offset ("the token three back") with a single linear operation — exactly the kind of pattern a linear projection inside attention can capture.

Self-attention needs an external order signal, supplied by adding a positional code to each embedding:

Permutation invariance. Attention treats its input as a set: \operatorname{Attention}(PX) = P\,\operatorname{Attention}(X), so with no positional signal it cannot tell token order.
Sinusoidal formula. PE(\textit{pos}, 2i) = \sin(\textit{pos}/10000^{2i/d}) and PE(\textit{pos}, 2i+1) = \cos(\textit{pos}/10000^{2i/d}) — each dimension a sinusoid of a different wavelength, fast to slow, giving every position a unique fingerprint.
Added, not concatenated. The code is summed into the embedding, \tilde{x}_{\textit{pos}} = x_{\textit{pos}} + PE(\textit{pos}), with no extra parameters.
Learned alternative. One may instead learn a position embedding per slot; sinusoids extrapolate to unseen lengths, and their relative shift is linear in \textit{pos}.

Adding a fixed code is not the only way. A popular modern alternative, rotary position embeddings, takes the rotation we just saw and applies it directly to the query and key vectors instead of summing a code into the embedding. Each pair of coordinates is rotated by an angle proportional to the token's position, so that the dot product between a query at position m and a key at position n depends only on the relative offset m - n — relative position baked structurally into attention rather than added on the side. We'll meet RoPE properly later; for now, notice the seed is already here in Step 5's rotation matrix.

See the wavelengths fan out

Each curve is one dimension of the positional code, plotted against position \textit{pos}: a fast sinusoid (low i, short wavelength) and progressively slower ones as i grows. Reading a single vertical slice — one position — down through all the curves gives that position its unique binary-clock fingerprint. Drag the zoom to stretch the position axis and watch the slow dimensions reveal their long wavelengths.