The Softmax Function
A classifier's final layer spits out a vector of raw real scores — logits —
one per class. They can be any size, positive or negative, and they don't add up to anything in
particular. To turn them into a usable answer we need a probability distribution:
all non-negative, summing to 1. The softmax function
is the universal "pick one of many" converter that does exactly this, and it sits at the end of
nearly every
multiclass
classifier — and, as we'll see, inside attention too.
From scores to a distribution
Given logits z = (z_1, \dots, z_K), softmax is:
\operatorname{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}.
Step 1 — exponentiate. Apply e^{z_i} to each
logit. Since e^{x} > 0 for every real
x, every entry is now strictly positive — the
first requirement of a probability.
Step 2 — normalise. Divide each by the sum of them all. The exponential also
amplifies differences: a logit a little larger than the rest becomes a probability a
lot larger — softmax is a soft, differentiable version of "take the max".
Step 3 — check it sums to 1. Add up all
K outputs; the common denominator is the sum of the numerators, so
it cancels:
\sum_{i=1}^{K} \operatorname{softmax}(z)_i = \frac{\sum_{i} e^{z_i}}{\sum_{j} e^{z_j}} = 1.
Positive and summing to one: a genuine probability distribution, by construction.
The numerically stable form
On a real computer e^{z_i} overflows the moment a logit gets even
moderately large (e^{1000} is \infty in
floating point). The fix is a one-line trick that changes the value by nothing at all.
Step 1 — subtract a constant from every logit. Let
m = \max_j z_j and multiply top and bottom by
e^{-m}:
\frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{e^{z_i} e^{-m}}{\sum_j e^{z_j} e^{-m}} = \frac{e^{z_i - m}}{\sum_j e^{z_j - m}}.
Step 2 — see why it's safe. The common factor
e^{-m} cancels, so the result is identical — but now the
largest exponent is z_i - m = 0, so every
e^{z_i - m} \le 1. No overflow, ever. This is the form every library
actually computes.
Temperature: sharpen or flatten
Divide the logits by a temperature T > 0 before
the softmax:
\operatorname{softmax}(z/T)_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}.
Step 1 — cool it down, T \to 0. Dividing by a tiny
T blows up the gaps between logits, so the largest one runs away and
the distribution collapses onto it — softmax sharpens to the argmax (a hard
one-hot pick).
Step 2 — heat it up, T \to \infty. Dividing by a
huge T crushes all the logits toward
0, so every e^{z_i/T} \to 1 and the
distribution flattens to uniform, 1/K each.
Temperature is the knob between "decisive" and "undecided". At T = 1
you get plain softmax.
The gradient that makes training easy
Here is the payoff that makes softmax the default output. Pair it with
cross-entropy
loss against a one-hot target y, write
p = \operatorname{softmax}(z), and the gradient of the loss with
respect to the logits is breathtakingly simple.
Step 1 — the combined loss. With true class
y one-hot, cross-entropy is
\mathcal{L} = -\sum_k y_k \log p_k.
Step 2 — differentiate through softmax. The algebra (the
e^{z_i} in both numerator and denominator) telescopes, and almost
everything cancels, leaving:
\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i.
Step 3 — read it. The gradient is simply prediction minus
target, p - y. If you predicted a class's probability too
high, push its logit down by the excess; too low, push it up by the shortfall. No exponentials,
no logs survive into the gradient — just an error signal. This clean form is half the reason
softmax-with-cross-entropy is the canonical classification head.
For logits z \in \mathbb{R}^K,
\operatorname{softmax}(z)_i = e^{z_i}/\sum_j e^{z_j}:
-
A distribution. Every output is positive and they sum to
1.
-
Stable form. Subtracting the max,
e^{z_i - m}/\sum_j e^{z_j - m}, gives the identical value with
no overflow.
-
Temperature. \operatorname{softmax}(z/T)
sharpens to the argmax as T\to 0 and flattens to uniform as
T\to\infty.
-
The gradient. With cross-entropy against a one-hot target,
\partial \mathcal{L}/\partial z = p - y — prediction minus
target.
Watch temperature reshape the distribution
Four fixed logits z = (2,\ 1,\ 0,\ -1) are run through
\operatorname{softmax}(z/T). Drag the temperature: cool it toward
0 and the mass piles onto the top logit (sharp, near one-hot); heat
it toward large T and the four bars level out toward
1/4 each (uniform). The four always sum to
1, whatever the temperature.
Softmax isn't only the last layer of a classifier — it lives in the middle of every
transformer. In
scaled
dot-product attention, each query scores every key with a dot product, and those
raw scores are passed through a softmax to become attention weights — a
probability distribution over which values to read. "How much should I attend to each token?"
is answered by the very same e^{z_i}/\sum_j e^{z_j}, temperature and
all (the 1/\sqrt{d} scaling is precisely a temperature that keeps the
logits from saturating the softmax). The function that picks one class is also the function that
decides where a model looks.