Encode position as a rotation
Split each query and key vector into pairs of coordinates, and treat each pair as a little 2-D
vector. RoPE rotates the pair at position m by an angle
m\theta — proportional to how far along the sequence the token sits (with
different pairs rotating at different frequencies \theta, like the hands
of many clocks turning at different speeds).
The magic appears when attention takes a dot product between a query at position
m and a key at position n. Because rotating one
vector by m\theta and the other by n\theta
leaves a dot product that depends only on the difference of the angles, the attention score
ends up depending on the relative position m - n — how
far apart the tokens are — rather than their absolute places. Relative position, which is what
language actually cares about, falls out for free.
- Encodes position by rotating query/key vectors by an angle proportional to
position — no vector is added.
- The resulting attention score depends on relative position
m - n, not absolute position.
- It's applied to queries and keys (not values), and extrapolates gracefully to
longer sequences than seen in training.
An added positional vector fixes an absolute slot ("you are token 5"), and a model
trained only up to length 2048 has never seen "token 4000", so it struggles to extrapolate. A
rotation instead bakes in relative distance ("you are 3 apart"), which is the same whether
the pair sits at positions 5–8 or 4000–4003. That relativity is why RoPE-based models generalise
to longer contexts so much more gracefully, and why techniques for stretching context further
(like interpolating the rotation frequencies) plug into it so naturally.
-
RoPE rotates the queries and keys — the things whose dot product forms the
attention scores — not the value vectors.
-
It carries no learned parameters: the rotation angles are fixed by position and frequency. Its
information lives in the relative angle between tokens, not an absolute code you
can read off a single vector.