Train, Validation, Test
We already split data into train
and test. But there's a subtlety. While building a model you make dozens of choices —
which features, how deep the tree, how strong the
regularization.
If you tune those by checking the test set, you've quietly let it leak in, and your final
score is a lie. So we use three splits:
- Training set — fits the model's parameters.
- Validation set — tunes the choices (hyperparameters); you may peek at it
many times.
- Test set — touched once, at the very end, for an honest final
score.
Split the data three ways
Slide to change how much data goes to training; the rest is shared between validation and test. A
typical split is something like 60/20/20. More training data means a better-fitted model, but you
still need enough held back to tune and to judge honestly.
Cross-validation, when data is scarce
With little data, a single validation set is wasteful and noisy. k-fold
cross-validation fixes this: split the training data into k
folds, then train k times, each time validating on a different fold and
training on the rest. Averaging the k scores gives a far steadier
estimate — every example gets used for both training and validation, just never at the same time.
The test set still stays sealed until the end.