Train, Validation, Test

We already split data into train and test. But there's a subtlety. While building a model you make dozens of choices — which features, how deep the tree, how strong the regularization. If you tune those by checking the test set, you've quietly let it leak in, and your final score is a lie. So we use three splits:

Split the data three ways

Slide to change how much data goes to training; the rest is shared between validation and test. A typical split is something like 60/20/20. More training data means a better-fitted model, but you still need enough held back to tune and to judge honestly.

Cross-validation, when data is scarce

With little data, a single validation set is wasteful and noisy. k-fold cross-validation fixes this: split the training data into k folds, then train k times, each time validating on a different fold and training on the rest. Averaging the k scores gives a far steadier estimate — every example gets used for both training and validation, just never at the same time. The test set still stays sealed until the end.