1. Information and feature importance
Welcome back. Now that we have an intuition for the information found in individual dimensions, let's discuss feature importance.
2. Information gain
Information gain is a synonym and specific measure of feature importance. The authors of Data Science for Business keenly observe that "when the world presents you with very large sets of features, it may be (extremely) useful to hearken back to...the idea of information gain and to select a subset of informative attributes. Doing so can substantially reduce the size of an unwieldy dataset, and...often will improve the accuracy of the resultant model."
3. Feature importance
Feature importance can be thought of as a measure of information in model building. Specifically, a predictor's importance depends on how much information it provides about the target variable. Predictors and target variables are features with specific roles.
While there are many measures of feature importance — like correlation with the target variable and standardized regression coefficients — we'll focus on information gain, which measures feature importance in decision trees.
4. Decision tree example
Imagine that we want to predict whether an applicant will default on a loan or not. We have sixteen observations. Y indicates a default and N indicates no default. To learn about information gain we'll make our predictors visual — shape, color, outline, and texture.
5. Decision tree and information gain
Which feature will provide us with the most information about the target variable?
To determine this, we calculate the information gain of each of those four predictors — shape, color, outline, and texture. Information gain is the amount of information that we know about one variable by observing another variable; and, in this example, is measured by the difference in the entropy of the parent and child nodes after the observations have been grouped by a predictor.
6. Entropy
Entropy is a measure of disorder. By disorder, we mean a lack of purity. Entropy ranges from 0 to 1. In the graph, notice how the sets on the left and right are purely squares and circles, respectively. They have zero entropy. The middle set has perfect entropy, with half squares and half circles.
7. Entropy: root node
To calculate entropy we sum the negative products of each class probability and the log base two of the class probability. We have two classes — yes and no.
For the yes class, there are seven yeses out of sixteen observations. Similarly, we calculate the probability of the no observations — nine out of sixteen.
Then we plug these class probabilities into the entropy equation. The entropy of the root node is zero point nine eight nine — almost perfect entropy.
Could we improve the entropy by splitting the observations by shape?
8. Entropy: children nodes
Here we've split the observations based on shape. Let's calculate the entropy of each child node. On the left, the class probabilities are two-ninths and seven-ninths, and the entropy is
9. Entropy: children nodes
zero point seven six four.
10. Entropy: children nodes
On the right, the probabilities are five-sevenths and two-sevenths
11. Entropy: children nodes
and the entropy is zero point eight six three.
12. Information gain: root to children
The information gain of shape is the root entropy minus the weighted average of the child entropies.
Nine of the sixteen observations are circles and seven are squares. We use these proportions for the weighted average, which we subtract from the root entropy. Shape provides an information gain of zero point one eight one.
13. Compare information gain across features
To determine the most informative feature, we calculate the information gain of all the features and discover that shape provides the most information about loan default.
Remember that information gain is only one way to measure feature importance. As we'll see later, we can extract feature importances from fitted models.
14. Let's practice!
Let's practice calculating entropy and information gain.