Selecting based on variance

1. Selecting based on variance

Another way we can select features based on their information content is by using variance. Low- or no-variance features contain little to no information for modeling.

2. Variance of unscaled data

To compare variances fairly across features, we need to normalize the data. This plot shows the means with error bars marking two standard deviations. Notice annual_income's range is much larger than all of the other variables.

3. Variance of scaled data

When we normalize the data, the feature variances are more comparable.

4. Calculate scaled variances

We calculate the normalized variances with summarize() and across(). We use scale() to normalize the data, setting center to false. Then we pass the normalized data to var() with na-dot-rm set to true. We store the feature variances in credit_variances. We pivot the data longer to tidy it so we can call arrange() to sort by variance in descending order. We'll use the descending order of the variances to explore natural cutoffs we can use as a threshold.

5. Variance cutoff

Here we see all the feature variances in credit_variances. As we scan down the variances,

6. Variance cutoff

we can see a natural break between amount_invested_monthly and monthly_inhand_salary. However, a variance threshold of, say, zero point five would eliminate too many features. So, we keep moving down.

7. Variance cutoff

The next gap is between monthly_balance and credit_history_months. That's probably a good place for a threshold because

8. Variance cutoff

the next natural break eliminates too few features. So, a threshold of zero point one would be good.

9. Variance cutoff plot

Before we move on, notice that a plot of descending feature variances makes the natural breaks a little easier to spot. However, establishing a variance threshold by natural breaks is somewhat subjective. We'll discuss a better method later.

10. Create variance filter

First, let's say we settle on a threshold of zero point one. Then, as we've seen before, we can create a filter mask using dplyr's filter() and pull() to reduce the dimensionality.

11. The tidymodel approach

But there is a better approach. The tidymodels package provides two recipe steps to select features based on variance. To use them we define a recipe object. The first input is a formula that allows tidymodels to distinguish between target and predictor variables. The target variable is credit_score and all other variables are predictors. Then we add step_zv() to the recipe. It removes features with zero variance. We do this before normalizing the data so we don't get an error due to zero variance. Then, we use tidymodel's step_scale() to normalize all the numeric predictors. Then we add step_nzv(). nzv stands for near-zero variance. It removes features that have very few unique values and a large count discrepancy between the first and second most occurring feature values. This is different than our naive variance threshold we filtered on previously. Finally, we call prep() to train the recipe. To apply the recipe to credit_df we use bake(), passing it the recipe object and setting new_data to null to indicate we want to apply the recipe to the same data it was trained on — credit_df.

12. Investigating effect of a specific step

One last note. After we've prepared a recipe, the tidy() function allows us to explore the effects of any of the steps in the recipe after it has been prepared. Here we pass it the prepared recipe object and set the number argument to the step number we want to explore — in this case, three, so we can see which features step_nzv will remove.

13. Let's practice!

Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.