Squared error is wrong for classification. We need a loss that grades a probability:
gentle when the model is confidently right, brutal when it's confidently wrong. That loss is
cross-entropy (also called log loss). For a single example with true label
Only one term survives each time. If the truth is
Pick the true label and watch the loss as the predicted probability slides from
A model that says "100% spam" about a real email should be punished without mercy —
cross-entropy's blow-up to infinity does exactly that, forcing the model to never be totally
certain unless it truly is. Averaged over the dataset and minimised by