Entropy and Information Gain

Which question should a tree ask first? The one that best tidies up the classes. To measure tidiness we use entropy — the amount of disorder in a group. A box that is all one class is perfectly pure (entropy 0); a 50/50 mix is maximal chaos (entropy 1). For two classes with proportions p and 1-p:

H(p) = -p\log_2 p - (1-p)\log_2(1-p).

The disorder curve

Slide the class mix. Entropy is highest at a perfect 50/50 split — total uncertainty — and falls to zero as the group becomes all one class. A good split is one that produces pure, low-entropy children.

Information gain picks the split

Information gain is how much entropy a split removes: the parent's entropy minus the average entropy of the children it creates. The tree greedily chooses, at each node, the cut with the highest information gain — the question that most reduces disorder. Repeat, and a tree grows itself. (Some trees use a near-identical measure called Gini impurity; the spirit is the same.) Next we'll see how this greediness, left unchecked, leads straight to overfitting.