Word Embeddings
We left off with a sentence as a sequence of
token ids.
Now each id must become a vector.
The naïve choice is one-hot: token i becomes a
length-V vector that is 1 in slot
i and 0 everywhere else.
e_i = (\,0,\ \dots,\ 0,\ \underset{i}{1},\ 0,\ \dots,\ 0\,)^{\top} \in \mathbb{R}^{V}.
It works, but it is wasteful and dumb: the vectors are enormous (a real vocabulary has
V \approx 50{,}000), almost entirely zero, and — fatally — every pair
is equally far apart. "cat" is exactly as similar to "dog" as it is to "thursday". One-hot
vectors carry an identity but no meaning.
An embedding is a learned lookup
Instead, give every word a short, dense vector of d numbers (say
d = 50 or 300) and learn
those numbers by training. Stack them as the rows of an embedding matrix
E:
E \in \mathbb{R}^{V \times d}, \qquad \text{row } i = \text{the embedding of word } i.
How do we fetch word i's embedding? Watch what happens when we
multiply
its one-hot vector by E.
Step 1 — write the matrix–vector product as a sum of columns. For a matrix
M and vector v,
(M v) is the linear combination
\sum_j v_j\, M_{:,j}. Apply it to E^{\top} e_i:
E^{\top} e_i = \sum_{j=1}^{V} (e_i)_j \, (E^{\top})_{:,j} = \sum_{j=1}^{V} (e_i)_j \, (\text{row } j \text{ of } E).
Step 2 — use that the one-hot is zero everywhere but slot i.
Every coefficient (e_i)_j is 0 except
(e_i)_i = 1, so all terms vanish but one:
E^{\top} e_i = 1 \cdot (\text{row } i \text{ of } E) = E_{i,:}.
Step 3 — read off the punchline. Multiplying the embedding matrix by a one-hot
vector simply selects row i. The "multiply" is a
lookup in disguise:
\text{embed}(i) \;=\; E^{\top} e_i \;=\; E_{i,:} \;\in\; \mathbb{R}^{d}.
In practice nobody forms the giant one-hot vector or does the multiply — they just index row
i directly. But seeing it as a matrix multiply is what makes
the embedding a normal, trainable layer: gradients flow into row i
whenever word i appears, nudging similar words together.
Meaning becomes geometry
Because the rows are trained, words that play similar roles drift to nearby points, and
directions in the space pick up meaning. The famous example: the vector from
man to woman is roughly the same as the one from king to
queen, so
\text{king} - \text{man} + \text{woman} \;\approx\; \text{queen}.
Analogies become vector arithmetic; similarity becomes
closeness in the
embedding space. That is the whole payoff of trading sparse identity for dense, learned geometry.
An embedding layer maps token ids to dense vectors via an embedding matrix
E \in \mathbb{R}^{V \times d}.
-
One-hot × E is a row lookup.
E^{\top} e_i = E_{i,:} — multiplying by the one-hot just
selects row i, so the layer is an indexable, trainable
table.
-
Dense beats sparse. One-hot vectors have length V
and are almost all zeros; embeddings have length d \ll V and every
entry is informative.
-
Similarity is geometry. Training places related words near each other, so
meaning shows up as distance and direction — letting analogies be solved by vector
arithmetic, \text{king} - \text{man} + \text{woman} \approx \text{queen}.
The first embeddings to make a splash came from word2vec (2013), which never
looked at a dictionary. It learned each vector by a single, almost trivial pretext task:
predict a word from the words around it (or vice versa). Sliding that window over billions of
words of text, it nudged the embedding of each word toward the company it keeps — and out
fell vectors in which meaning is geometry.
The analogy \text{king} - \text{man} + \text{woman} \approx \text{queen}
is a parallelogram in the embedding space: the arrow
\text{man} \to \text{king} (loosely, "royalty") is parallel to the
arrow \text{woman} \to \text{queen}. Starting at
king, subtracting the "maleness" direction and adding "femaleness" lands you next to
queen — the fourth corner of the parallelogram. The interactive below lets you walk a
couple of these arrows yourself.
Walk an analogy
A toy 2-D embedding of six words. Pick an analogy and the figure draws the arithmetic:
start at the first word, subtract the second, add the third — the dashed result vector lands
right next to the answer. The two solid arrows are parallel: that parallelism
is the analogy.