Regularization
Regularization fights overfitting by adding a penalty for complexity to the cost.
Instead of only minimizing the error, the model minimizes
J = \underbrace{\text{error}}_{\text{fit the data}} + \lambda\underbrace{\lVert\vec{w}\rVert^2}_{\text{keep weights small}}.
The extra term is the squared
length
of the weight vector, scaled by a strength \lambda. It nudges every
weight toward zero, so the model can only use a big weight if the data really earns it. Small
weights mean a smoother, calmer function — and less overfitting.
Turn up the penalty
Start with a wild overfit curve at \lambda = 0. Increase
\lambda and watch the penalty tame it — the wiggles flatten out and the
curve relaxes toward a smooth trend. Too much, though, and it over-smooths into an underfit. The
penalty strength is itself a bias–variance dial.
The flavours
Penalizing the squared length (the L_2 norm) is called ridge
regression; it shrinks weights smoothly. Penalizing the sum of absolute values (the
L_1 norm) is lasso, which drives some weights exactly
to zero — automatic feature selection. Either way, \lambda is a
bias–variance
knob you tune on held-out data. Regularization is one of the most reliable tools for making models
generalize.