Regularization

Regularization fights overfitting by adding a penalty for complexity to the cost. Instead of only minimizing the error, the model minimizes

J = \underbrace{\text{error}}_{\text{fit the data}} + \lambda\underbrace{\lVert\vec{w}\rVert^2}_{\text{keep weights small}}.

The extra term is the squared length of the weight vector, scaled by a strength \lambda. It nudges every weight toward zero, so the model can only use a big weight if the data really earns it. Small weights mean a smoother, calmer function — and less overfitting.

Turn up the penalty

Start with a wild overfit curve at \lambda = 0. Increase \lambda and watch the penalty tame it — the wiggles flatten out and the curve relaxes toward a smooth trend. Too much, though, and it over-smooths into an underfit. The penalty strength is itself a bias–variance dial.

The flavours

Penalizing the squared length (the L_2 norm) is called ridge regression; it shrinks weights smoothly. Penalizing the sum of absolute values (the L_1 norm) is lasso, which drives some weights exactly to zero — automatic feature selection. Either way, \lambda is a bias–variance knob you tune on held-out data. Regularization is one of the most reliable tools for making models generalize.