The Softmax Function

A classifier's final layer spits out a vector of raw real scores — logits — one per class. They can be any size, positive or negative, and they don't add up to anything in particular. To turn them into a usable answer we need a probability distribution: all non-negative, summing to 1. The softmax function is the universal "pick one of many" converter that does exactly this, and it sits at the end of nearly every multiclass classifier — and, as we'll see, inside attention too.

From scores to a distribution

Given logits z = (z_1, \dots, z_K), softmax is:

\operatorname{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}.

Step 1 — exponentiate. Apply e^{z_i} to each logit. Since e^{x} > 0 for every real x, every entry is now strictly positive — the first requirement of a probability.

Step 2 — normalise. Divide each by the sum of them all. The exponential also amplifies differences: a logit a little larger than the rest becomes a probability a lot larger — softmax is a soft, differentiable version of "take the max".

Step 3 — check it sums to 1. Add up all K outputs; the common denominator is the sum of the numerators, so it cancels:

\sum_{i=1}^{K} \operatorname{softmax}(z)_i = \frac{\sum_{i} e^{z_i}}{\sum_{j} e^{z_j}} = 1.

Positive and summing to one: a genuine probability distribution, by construction.

The numerically stable form

On a real computer e^{z_i} overflows the moment a logit gets even moderately large (e^{1000} is \infty in floating point). The fix is a one-line trick that changes the value by nothing at all.

Step 1 — subtract a constant from every logit. Let m = \max_j z_j and multiply top and bottom by e^{-m}:

\frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{e^{z_i} e^{-m}}{\sum_j e^{z_j} e^{-m}} = \frac{e^{z_i - m}}{\sum_j e^{z_j - m}}.

Step 2 — see why it's safe. The common factor e^{-m} cancels, so the result is identical — but now the largest exponent is z_i - m = 0, so every e^{z_i - m} \le 1. No overflow, ever. This is the form every library actually computes.

Temperature: sharpen or flatten

Divide the logits by a temperature T > 0 before the softmax:

\operatorname{softmax}(z/T)_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}.

Step 1 — cool it down, T \to 0. Dividing by a tiny T blows up the gaps between logits, so the largest one runs away and the distribution collapses onto it — softmax sharpens to the argmax (a hard one-hot pick).

Step 2 — heat it up, T \to \infty. Dividing by a huge T crushes all the logits toward 0, so every e^{z_i/T} \to 1 and the distribution flattens to uniform, 1/K each.

Temperature is the knob between "decisive" and "undecided". At T = 1 you get plain softmax.

The gradient that makes training easy

Here is the payoff that makes softmax the default output. Pair it with cross-entropy loss against a one-hot target y, write p = \operatorname{softmax}(z), and the gradient of the loss with respect to the logits is breathtakingly simple.

Step 1 — the combined loss. With true class y one-hot, cross-entropy is \mathcal{L} = -\sum_k y_k \log p_k.

Step 2 — differentiate through softmax. The algebra (the e^{z_i} in both numerator and denominator) telescopes, and almost everything cancels, leaving:

\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i.

Step 3 — read it. The gradient is simply prediction minus target, p - y. If you predicted a class's probability too high, push its logit down by the excess; too low, push it up by the shortfall. No exponentials, no logs survive into the gradient — just an error signal. This clean form is half the reason softmax-with-cross-entropy is the canonical classification head.

For logits z \in \mathbb{R}^K, \operatorname{softmax}(z)_i = e^{z_i}/\sum_j e^{z_j}:

Watch temperature reshape the distribution

Four fixed logits z = (2,\ 1,\ 0,\ -1) are run through \operatorname{softmax}(z/T). Drag the temperature: cool it toward 0 and the mass piles onto the top logit (sharp, near one-hot); heat it toward large T and the four bars level out toward 1/4 each (uniform). The four always sum to 1, whatever the temperature.

Softmax isn't only the last layer of a classifier — it lives in the middle of every transformer. In scaled dot-product attention, each query scores every key with a dot product, and those raw scores are passed through a softmax to become attention weights — a probability distribution over which values to read. "How much should I attend to each token?" is answered by the very same e^{z_i}/\sum_j e^{z_j}, temperature and all (the 1/\sqrt{d} scaling is precisely a temperature that keeps the logits from saturating the softmax). The function that picks one class is also the function that decides where a model looks.