1. Features with missing values or little variance
In this lesson, we'll automate the selection of features that have sufficient variance and not too many missing values.
2. Creating a feature selector
Let's once again use the ANSUR dataset
which has 6068 rows and 94 numeric columns as an example. Like we saw in chapter one, low variance features are so similar between different observations that they may contain little information you can use in an analysis.
To remove them, we can use one of Scikit-learn's built-in feature selection tools called VarianceThreshold().
When we create the selector we can set the minimal variance threshold. Here, we've set it to a variance of one.
After fitting the selector to our dataset,
its .get_support() method will give us a True or False value on whether each feature's variance is above the threshold or not. We call this type of boolean array a mask,
3. Applying a feature selector
and we can use this mask to reduce the number of dimensions in our dataset.
We use the DataFrame's .loc[] method and specify we want to select all rows using a colon for the first argument and sub-select the columns by using our mask as the second.
In this case, our selector has reduced the number of features by just one.
4. Variance selector caveats
One problem with variance thresholds is that variance values aren't always easy to interpret or compare between features. We can illustrate this by looking at a number of buttock features from our ANSUR dataset. In the boxplot you can see that these different measurements have different medians, the green horizontal lines in each blue rectangle, and variances, the height of the blue rectangles and the length of the vertical lines coming from them. For this dataset higher values tend to have higher variances and we should therefore normalize the variance before using it for feature selection.
5. Normalizing the variance
To do so we divide each column by its mean value before fitting the selector. After normalization, the variance in the dataset will be lower. We therefore also reduced the variance threshold to 0.005. Make sure to inspect your data visually while setting this value.
When we apply the selector to our dataset the number of features is more than halved, to 45.
6. Missing value selector
Another reason you might want to drop a feature is that it contains a lot of missing values.
7. Missing value selector
These missing values show up in our pandas DataFrame of the Pokemon dataset as NaN.
8. Identifying missing values
and with the .isna() method we can identify them with a boolean value.
9. Counting missing values
Boolean values can be summed since a True value resembles a one and False a zero, so if we chain the .sum() method to .isna() we get the total number of missing values in each column.
10. Counting missing values
If we then divide this number by the total number of rows in the DataFrame we get a ratio of missing values between zero and one. In this example it turns out that almost half the Pokemon don't have a value for the Type 2 feature.
11. Applying a missing value threshold
Based on this ratio we can create a mask for features that have fewer missing values than a certain threshold. In this case, we set it to 0.3.
12. Applying a missing value threshold
Once again we can pass our mask to the .loc[] method to sub-select the columns. You can see that the Type 2 feature is gone now.
When features have some missing values, but not too much, we could apply imputation to fill in the blanks. But we won't go into that in this course.
13. Let's practice
Now it's your turn to remove features with little variance or many missing values.