A big network can cheat. Given enough capacity it learns brittle little conspiracies — "if
neuron 47 fires, neuron 91 should too" — that fit the training data perfectly and generalise
terribly. That is
Forced to do its job even when random teammates vanish, each unit has to learn a feature that
is useful on its own, not one that only works in cahoots with a specific partner.
The result is a powerful, almost free
Dropout behaves differently in training and at inference. The modern "inverted dropout" formulation does all the bookkeeping at training time so inference stays a plain forward pass.
Step 1 — sample a mask (training). For each unit independently, draw a
Bernoulli "keep" coin. With drop probability
Step 2 — apply the mask, then rescale the survivors. Multiply each
activation
Step 3 — check the expectation is preserved. Why that
Spelled out the other way: with probability
The expected activation is unchanged. That is the whole point of the rescale: each layer's downstream input keeps the same average size whether or not dropout is active.
Step 4 — inference: use the full network, no scaling. At test time we want a single deterministic prediction, so we keep every unit and apply no mask and no rescale:
Because training already rescaled by
There is a deeper way to read dropout. Each training step samples a random mask, which is to
say it samples a sub-network — the full architecture with some units
deleted. A network of
Ensembles generalise well because their members' errors partly cancel — and dropout buys
that benefit at the price of a single model. At inference, using the full network with the
A small fully-connected network: each hidden unit is dimmed with probability