Rotate, don't add — line by line
Work in one pair of coordinates at a time. Take a query vector
q \in \mathbb{R}^d and group its coordinates into 2-D pairs
(q_0, q_1), (q_2, q_3), \dots — exactly the pairing the sinusoidal
code used, now treated as little plane vectors.
Step 1 — pick an angle that grows with position. Pair
i gets a fixed base frequency
\theta_i = 10000^{-2i/d} (fast pairs spin quickly, slow pairs barely
move — same geometric spread as before). A token at position m turns
that pair by the angle
m\,\theta_i.
Position scales the rotation: token 0 is unrotated, token
m is rotated m times as far.
Step 2 — apply a 2-D
rotation
to the pair. Rotating the pair at position m uses the
ordinary rotation matrix
R_m^{(i)} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix}, \qquad \tilde{q}^{(i)} = R_m^{(i)}\,q^{(i)}.
Stacking these 2\times 2 blocks down the diagonal gives one big
block-diagonal rotation R_m acting on the whole vector. The key at
position n is rotated the same way:
\tilde{k} = R_n\,k. Crucially the length of every pair is
untouched — a rotation only turns.
Step 3 — form the attention score. Attention compares query and key by a
dot product.
With RoPE the score between a query at m and a key at
n is
\langle \tilde{q}, \tilde{k}\rangle = (R_m q)^{\top}(R_n k) = q^{\top} R_m^{\top} R_n\, k.
Step 4 — rotations compose. A rotation's inverse is its transpose, and turning
by -m\theta_i then +n\theta_i is a single
turn by (n-m)\theta_i:
R_m^{\top} R_n = R_{-m}\,R_n = R_{\,n-m}.
The two absolute positions m and n have
collapsed into the single relative offset n-m.
Step 5 — read off the punchline. Substituting back,
\langle \tilde{q}, \tilde{k}\rangle = q^{\top} R_{\,n-m}\, k = g(q, k,\; n-m).
The attention score depends on q, k, and
only the gap n-m — never on where the pair sits in
absolute terms. Slide both tokens ten positions to the right and their score is identical. The
model sees relative position, baked structurally into the dot product rather than added
on the side.
Step 6 — why it extrapolates. Because only the offset matters and each pair's
rotation is a smooth periodic function of that offset, a model trained on short contexts has
already seen every small relative angle. Feeding it a longer sequence asks for the
same relative rotations, just more of them — so RoPE generalises to longer contexts far
more gracefully than a learned per-slot table or a fixed additive code ever did.
RoPE encodes position by rotating queries and keys rather than adding a code:
-
Rotate by m\theta. Pair the coordinates; a token
at position m rotates pair i by
m\theta_i with \theta_i = 10000^{-2i/d},
i.e. \tilde{q} = R_m q, \tilde{k} = R_n k.
-
Score depends only on n-m. Since
R_m^{\top} R_n = R_{\,n-m}, the dot product
\langle R_m q, R_n k\rangle = q^{\top} R_{\,n-m}\, k is a function
of the relative offset alone.
-
Better length extrapolation. Norms are preserved and only relative angles
appear, so models with RoPE generalise to contexts longer than they were trained on far more
gracefully than added sinusoids or learned position tables.
The constant 10000 in \theta_i = b^{-2i/d}
is the RoPE base b. It sets the slowest wavelength —
how far apart two positions must be before their relative rotation has wrapped all the way
around. Make b larger and every pair spins more slowly, so the same
number of distinct rotations stretches across a longer span of positions.
That is exactly the lever practitioners pull to extend a model's context window after training.
NTK / base scaling raises b; position
interpolation instead rescales the position itself, replacing m
with m \cdot L_{\text{train}}/L_{\text{new}} so positions in a longer
window fall back inside the range of angles seen during training. A few hundred fine-tuning steps
then adapt the model, and a 4k-context model serves 32k. We'll meet
long-context
extension in its own right later — for now, notice that the whole trick is possible
only because RoPE made the score a smooth function of the relative angle.