Cross-Entropy Loss

Squared error is wrong for classification. We need a loss that grades a probability: gentle when the model is confidently right, brutal when it's confidently wrong. That loss is cross-entropy (also called log loss). For a single example with true label y \in \{0, 1\} and predicted probability p:

L = -\big[\,y\log p + (1 - y)\log(1 - p)\,\big].

Only one term survives each time. If the truth is y = 1, the loss is -\log p; if y = 0, it's -\log(1 - p). Either way, predicting the right class with high confidence gives almost zero loss, and being confidently wrong sends it soaring.

The cost of being wrong

Pick the true label and watch the loss as the predicted probability slides from 0 to 1. Toward the correct end it dips to zero; toward the wrong end it shoots to infinity. That steep punishment is what drives the model to be both correct and honest about its confidence.

Why infinity, on purpose

A model that says "100% spam" about a real email should be punished without mercy — cross-entropy's blow-up to infinity does exactly that, forcing the model to never be totally certain unless it truly is. Averaged over the dataset and minimised by gradient descent, it trains logistic regression and the output layer of nearly every classification neural network.