1. Removing highly correlated features
Features that are perfectly correlated to each other, with a correlation coefficient of one or minus one, bring no new information to a dataset but do add to the complexity.
So naturally, we would want to drop one of the two features that hold the same information. In addition to this we might want to drop features that have correlation coefficients close to one or minus one if they are measurements of the same or similar things.
2. Highly correlated data
For example, in the ANSUR dataset there are measurements for suprasternale, cervicale and chest height. The suprasternale and cervicale are two bones in the chest region so these three measurements always have very similar values.
3. Highly correlated features
We get correlation coefficients as high as 98%. So for these features it too makes sense to keep only one. Not just for simplicity's sake but also to avoid models to overfit on the small, probably meaningless, differences between these values.
If you are confident that dropping highly correlated features will not cause you to lose too much information, you can filter them out using a threshold value.
4. Removing highly correlated features
First create a correlation matrix and take the absolute values of that, to also filter out strong negative correlations. Then create a mask for the upper triangle of the dataset just like we did when we were visualizing the correlation matrix.
When we pass this mask to the pandas DataFrame .mask() method it will replace all positions in the DataFrame where the mask has a True value with NA. So that our correlation matrix DataFrame looks like this.
5. Removing highly correlated features
We can then use a list comprehension to find all columns that have a correlation to any feature stronger than the threshold value.
The reason we used the mask to set half of the matrix to NA values is that we want to avoid removing both features when they have a strong correlation.
Finally we drop the selected features from the DataFrame with the .drop() method.
6. Feature extraction as an alternative
The method we just discussed is a bit of a brute force approach that should only be applied if you have a good understanding of the dataset. If you're unsure whether removing highly correlated features will remove important information from the data but still need to reduce dimensionality, you could consider feature extraction techniques. These remove correlated features for you, and we'll be looking into them in the final chapter.
7. Correlation caveats - Anscombe's quartet
What's important to know about correlation coefficients is that they can produce weird results when the relation between two features is non-linear or when outliers are involved. For example, the four datasets displayed here, known as Anscombe's quartet, all have the same correlation coefficient of 0.82. To avoid unpleasant surprises like this, make sure you visually check your data early on.
8. Correlation caveats - causation
A final thing to know about strong correlations is that they do not imply causation. In this example dataset, the number of firetrucks sent to a fire is correlated to the number of people wounded by that fire. Concluding that the higher number of wounded people is caused by sending more firetrucks would be wrong and even dangerous if used as a reason to send fewer trucks in the future.
9. Let's practice!
For the moment, it's your turn to remove highly correlated features from a dataset.