Dropout

A big network can cheat. Given enough capacity it learns brittle little conspiracies — "if neuron 47 fires, neuron 91 should too" — that fit the training data perfectly and generalise terribly. That is overfitting, and dropout is a startlingly cheap cure: on every training step, randomly switch off a fraction of the units, so the network can never lean on any single one being present.

Forced to do its job even when random teammates vanish, each unit has to learn a feature that is useful on its own, not one that only works in cahoots with a specific partner. The result is a powerful, almost free regulariser.

The rule, line by line

Dropout behaves differently in training and at inference. The modern "inverted dropout" formulation does all the bookkeeping at training time so inference stays a plain forward pass.

Step 1 — sample a mask (training). For each unit independently, draw a Bernoulli "keep" coin. With drop probability p, the unit is zeroed with probability p and kept with probability 1-p:

m_j \sim \text{Bernoulli}(1-p), \qquad m_j \in \{0, 1\}.

Step 2 — apply the mask, then rescale the survivors. Multiply each activation a_j by its mask bit, and divide the kept ones by 1-p:

\tilde{a}_j = \frac{m_j}{1-p}\, a_j.

Step 3 — check the expectation is preserved. Why that 1/(1-p) factor? So that turning units off doesn't quietly shrink the signal. Take the expectation over the random mask for a single unit, using \mathbb{E}[m_j] = 1-p:

\mathbb{E}[\tilde{a}_j] = \frac{\mathbb{E}[m_j]}{1-p}\, a_j = \frac{1-p}{1-p}\, a_j = a_j.

Spelled out the other way: with probability p the activation is 0, and with probability 1-p it is a_j/(1-p), so on average

\mathbb{E}[\tilde{a}_j] = p \cdot 0 + (1-p)\cdot \frac{a_j}{1-p} = a_j.

The expected activation is unchanged. That is the whole point of the rescale: each layer's downstream input keeps the same average size whether or not dropout is active.

Step 4 — inference: use the full network, no scaling. At test time we want a single deterministic prediction, so we keep every unit and apply no mask and no rescale:

\tilde{a}_j = a_j \qquad (\text{inference}).

Because training already rescaled by 1/(1-p), the expected training activation matched the full inference activation all along — so the two phases line up automatically, with no test-time correction to remember.

Dropout with drop probability p (inverted formulation):

Random mask (training): each unit is independently zeroed with probability p, kept with probability 1-p — a fresh mask every step.
Expectation-preserving scale: surviving activations are multiplied by \dfrac{1}{1-p}, so \mathbb{E}[\tilde{a}_j] = a_j — the average signal is unchanged.
Train vs inference: at inference the full network is used with no mask and no scaling; the training-time rescale makes the two phases match automatically.

There is a deeper way to read dropout. Each training step samples a random mask, which is to say it samples a sub-network — the full architecture with some units deleted. A network of n droppable units has 2^n possible masks, so over training you are, in effect, training an astronomically large ensemble of 2^n sub-networks, all sharing one set of weights.

Ensembles generalise well because their members' errors partly cancel — and dropout buys that benefit at the price of a single model. At inference, using the full network with the 1/(1-p) scaling acts as a fast approximate average over that exponential family of sub-networks (a "geometric mean of predictions" argument makes this precise for a single layer). One model, trained once, behaving like an ensemble of billions — that is why so cheap a trick is so effective.

Watch units drop out

A small fully-connected network: each hidden unit is dimmed with probability p, its connections fading with it — the random sub-network this step would train on. Slide p to drop more or fewer (a typical value is 0.5 for hidden layers), and hit Refresh to resample a fresh mask, just as the next training step would.