Get startedGet started for free

Selecting based on correlation with other features

1. Selecting based on correlation with other features

In this lesson, we'll explore one last feature selection method based on correlation. Remember that correlation is a measure of mutual information; and that mutual information represents redundant information.

2. Review correlation plot creation

Let's review how to make a correlation plot to help us identify mutual information. We'll use a subset of the healthcare company attrition data. First, we pass healthcare_df to select() and where() to select the continuous variables. Then we pipe it to correlate() to create a correlation matrix, which we pipe to shave() to remove the redundant upper half of the matrix. We pipe that to rplot(), setting print_cor to true to overlay the numeric correlations on the plot. Lastly, we rotate the x-axis labels for better readability.

3. Correlation plot

That code produces this correlation plot. Once we have the correlations, we might wonder, "How strongly should two features be correlated before we remove one of them from the data?"

4. Correlation strength

Here are some general guidelines. Correlations less than zero point three are usually considered small (or low), between zero point three and zero point seven are medium, and greater than zero point seven are strong (or high). These rough classifications will vary based on context, though. For instance, in the physical sciences high correlations are more common than in the social sciences simply because humans are more unpredictable and messy.

5. A correlation filter?

In the past few lessons, we established a threshold to filter out features. However, with correlations it is more tricky because a correlation, by definition, involves two features. In this example, performance rating and percent salary hike are highly correlated at zero point seven seven. However, we don't want to remove them both — only one of them.

6. A correlation filter?

Let's illustrate this with a Venn diagram that captures the mutual information, or correlation, between percent salary hike and performance rating. In other words, the overlapping area represents information that is contained in both features. So, it's redundant.

7. A correlation filter?

If we remove performance rating, we would keep one copy of the mutual information. That's a good thing. But we also lose the unique information that performance rating provides.

8. A correlation filter?

However, if we also removed percent salary hike we'd lose that important information that percent salary hike shared with performance rating. So, we see that creating a threshold-based filter is not straightforward.

9. A correlation filter recipe

Therefore, we will default to tidymodel's step_corr() recipe step. Here's how to use step_corr(). First, we create the recipe object and define the formula by specifying the target variable — Attrition — and predictor variables as all other variables. We set data to healthcare_df. Then we add step_corr() to the recipe and set threshold to zero point seven to identify highly correlated features. We call prep() to train the recipe on healthcare_df. To apply the correlation filter, we use the prepared recipe to bake the data. Notice how we pipe the recipe to bake and set new_data to null to indicate we want to bake the data in healthcare_df — the same data frame we used to prepare the recipe. Lastly, if we'd like to see what features the recipe will remove, we use tidy. We pass it the prepared recipe and set number to one. step_corr() is the first and only step in this recipe. And that's all!

10. Let's practice!

Let's practice!