Rotary Position Embeddings (RoPE)

The original transformer added a sinusoidal code to each embedding before attention ran. It works, but it bolts position on from the outside — and the model only ever sees an absolute index. Modern models (Llama, most of the recent crop) use a sharper idea: instead of adding a position vector, rotate each query and key by an angle proportional to its position. The relative geometry of two tokens then falls out of attention for free.

Rotate, don't add — line by line

Work in one pair of coordinates at a time. Take a query vector q \in \mathbb{R}^d and group its coordinates into 2-D pairs (q_0, q_1), (q_2, q_3), \dots — exactly the pairing the sinusoidal code used, now treated as little plane vectors.

Step 1 — pick an angle that grows with position. Pair i gets a fixed base frequency \theta_i = 10000^{-2i/d} (fast pairs spin quickly, slow pairs barely move — same geometric spread as before). A token at position m turns that pair by the angle

m\,\theta_i.

Position scales the rotation: token 0 is unrotated, token m is rotated m times as far.

Step 2 — apply a 2-D rotation to the pair. Rotating the pair at position m uses the ordinary rotation matrix

R_m^{(i)} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix}, \qquad \tilde{q}^{(i)} = R_m^{(i)}\,q^{(i)}.

Stacking these 2\times 2 blocks down the diagonal gives one big block-diagonal rotation R_m acting on the whole vector. The key at position n is rotated the same way: \tilde{k} = R_n\,k. Crucially the length of every pair is untouched — a rotation only turns.

Step 3 — form the attention score. Attention compares query and key by a dot product. With RoPE the score between a query at m and a key at n is

\langle \tilde{q}, \tilde{k}\rangle = (R_m q)^{\top}(R_n k) = q^{\top} R_m^{\top} R_n\, k.

Step 4 — rotations compose. A rotation's inverse is its transpose, and turning by -m\theta_i then +n\theta_i is a single turn by (n-m)\theta_i:

R_m^{\top} R_n = R_{-m}\,R_n = R_{\,n-m}.

The two absolute positions m and n have collapsed into the single relative offset n-m.

Step 5 — read off the punchline. Substituting back,

\langle \tilde{q}, \tilde{k}\rangle = q^{\top} R_{\,n-m}\, k = g(q, k,\; n-m).

The attention score depends on q, k, and only the gap n-m — never on where the pair sits in absolute terms. Slide both tokens ten positions to the right and their score is identical. The model sees relative position, baked structurally into the dot product rather than added on the side.

Step 6 — why it extrapolates. Because only the offset matters and each pair's rotation is a smooth periodic function of that offset, a model trained on short contexts has already seen every small relative angle. Feeding it a longer sequence asks for the same relative rotations, just more of them — so RoPE generalises to longer contexts far more gracefully than a learned per-slot table or a fixed additive code ever did.

RoPE encodes position by rotating queries and keys rather than adding a code:

Rotate by m\theta. Pair the coordinates; a token at position m rotates pair i by m\theta_i with \theta_i = 10000^{-2i/d}, i.e. \tilde{q} = R_m q, \tilde{k} = R_n k.
Score depends only on n-m. Since R_m^{\top} R_n = R_{\,n-m}, the dot product \langle R_m q, R_n k\rangle = q^{\top} R_{\,n-m}\, k is a function of the relative offset alone.
Better length extrapolation. Norms are preserved and only relative angles appear, so models with RoPE generalise to contexts longer than they were trained on far more gracefully than added sinusoids or learned position tables.

The constant 10000 in \theta_i = b^{-2i/d} is the RoPE base b. It sets the slowest wavelength — how far apart two positions must be before their relative rotation has wrapped all the way around. Make b larger and every pair spins more slowly, so the same number of distinct rotations stretches across a longer span of positions.

That is exactly the lever practitioners pull to extend a model's context window after training. NTK / base scaling raises b; position interpolation instead rescales the position itself, replacing m with m \cdot L_{\text{train}}/L_{\text{new}} so positions in a longer window fall back inside the range of angles seen during training. A few hundred fine-tuning steps then adapt the model, and a 4k-context model serves 32k. We'll meet long-context extension in its own right later — for now, notice that the whole trick is possible only because RoPE made the score a smooth function of the relative angle.

Watch a query spin with its position

The faint arrow is one query pair at position 0 — its reference orientation. Drive the position slider and the bold arrow turns by m\theta: same vector, rotated further the later it sits. Its length never changes, only its angle. Two tokens ten apart always differ by the same turn 10\,\theta, wherever they sit — and that fixed relative angle is the only thing attention's dot product can see.