Word Embeddings

We left off with a sentence as a sequence of token ids. Now each id must become a vector. The naïve choice is one-hot: token i becomes a length-V vector that is 1 in slot i and 0 everywhere else.

e_i = (\,0,\ \dots,\ 0,\ \underset{i}{1},\ 0,\ \dots,\ 0\,)^{\top} \in \mathbb{R}^{V}.

It works, but it is wasteful and dumb: the vectors are enormous (a real vocabulary has V \approx 50{,}000), almost entirely zero, and — fatally — every pair is equally far apart. "cat" is exactly as similar to "dog" as it is to "thursday". One-hot vectors carry an identity but no meaning.

An embedding is a learned lookup

Instead, give every word a short, dense vector of d numbers (say d = 50 or 300) and learn those numbers by training. Stack them as the rows of an embedding matrix E:

E \in \mathbb{R}^{V \times d}, \qquad \text{row } i = \text{the embedding of word } i.

How do we fetch word i's embedding? Watch what happens when we multiply its one-hot vector by E.

Step 1 — write the matrix–vector product as a sum of columns. For a matrix M and vector v, (M v) is the linear combination \sum_j v_j\, M_{:,j}. Apply it to E^{\top} e_i:

E^{\top} e_i = \sum_{j=1}^{V} (e_i)_j \, (E^{\top})_{:,j} = \sum_{j=1}^{V} (e_i)_j \, (\text{row } j \text{ of } E).

Step 2 — use that the one-hot is zero everywhere but slot i. Every coefficient (e_i)_j is 0 except (e_i)_i = 1, so all terms vanish but one:

E^{\top} e_i = 1 \cdot (\text{row } i \text{ of } E) = E_{i,:}.

Step 3 — read off the punchline. Multiplying the embedding matrix by a one-hot vector simply selects row i. The "multiply" is a lookup in disguise:

\text{embed}(i) \;=\; E^{\top} e_i \;=\; E_{i,:} \;\in\; \mathbb{R}^{d}.

In practice nobody forms the giant one-hot vector or does the multiply — they just index row i directly. But seeing it as a matrix multiply is what makes the embedding a normal, trainable layer: gradients flow into row i whenever word i appears, nudging similar words together.

Meaning becomes geometry

Because the rows are trained, words that play similar roles drift to nearby points, and directions in the space pick up meaning. The famous example: the vector from man to woman is roughly the same as the one from king to queen, so

\text{king} - \text{man} + \text{woman} \;\approx\; \text{queen}.

Analogies become vector arithmetic; similarity becomes closeness in the embedding space. That is the whole payoff of trading sparse identity for dense, learned geometry.

An embedding layer maps token ids to dense vectors via an embedding matrix E \in \mathbb{R}^{V \times d}.

The first embeddings to make a splash came from word2vec (2013), which never looked at a dictionary. It learned each vector by a single, almost trivial pretext task: predict a word from the words around it (or vice versa). Sliding that window over billions of words of text, it nudged the embedding of each word toward the company it keeps — and out fell vectors in which meaning is geometry.

The analogy \text{king} - \text{man} + \text{woman} \approx \text{queen} is a parallelogram in the embedding space: the arrow \text{man} \to \text{king} (loosely, "royalty") is parallel to the arrow \text{woman} \to \text{queen}. Starting at king, subtracting the "maleness" direction and adding "femaleness" lands you next to queen — the fourth corner of the parallelogram. The interactive below lets you walk a couple of these arrows yourself.

Walk an analogy

A toy 2-D embedding of six words. Pick an analogy and the figure draws the arithmetic: start at the first word, subtract the second, add the third — the dashed result vector lands right next to the answer. The two solid arrows are parallel: that parallelism is the analogy.