Rotary Position Embeddings

A transformer's attention is permutation-blind — on its own it has no idea which token came first. The original fix was to add a positional encoding vector to each token. Rotary Position Embeddings (RoPE) take a cleverer route: instead of adding position information, they rotate each query and key vector by an angle that depends on its position. It's the position scheme behind most modern large language models — LLaMA, GPT-NeoX, PaLM and many more.

Encode position as a rotation

Split each query and key vector into pairs of coordinates, and treat each pair as a little 2-D vector. RoPE rotates the pair at position m by an angle m\theta — proportional to how far along the sequence the token sits (with different pairs rotating at different frequencies \theta, like the hands of many clocks turning at different speeds).

The magic appears when attention takes a dot product between a query at position m and a key at position n. Because rotating one vector by m\theta and the other by n\theta leaves a dot product that depends only on the difference of the angles, the attention score ends up depending on the relative position m - n — how far apart the tokens are — rather than their absolute places. Relative position, which is what language actually cares about, falls out for free.

Encodes position by rotating query/key vectors by an angle proportional to position — no vector is added.
The resulting attention score depends on relative position m - n, not absolute position.
It's applied to queries and keys (not values), and extrapolates gracefully to longer sequences than seen in training.

An added positional vector fixes an absolute slot ("you are token 5"), and a model trained only up to length 2048 has never seen "token 4000", so it struggles to extrapolate. A rotation instead bakes in relative distance ("you are 3 apart"), which is the same whether the pair sits at positions 5–8 or 4000–4003. That relativity is why RoPE-based models generalise to longer contexts so much more gracefully, and why techniques for stretching context further (like interpolating the rotation frequencies) plug into it so naturally.

RoPE rotates the queries and keys — the things whose dot product forms the attention scores — not the value vectors.
It carries no learned parameters: the rotation angles are fixed by position and frequency. Its information lives in the relative angle between tokens, not an absolute code you can read off a single vector.