1. Pairwise correlation
In the last two lessons, we focused on techniques to remove features based on their individual properties, such as the variance they show or the proportion of missing values they have. A next step is to look at how features relate to one another to decide if they are worth keeping.
2. Pairwise correlation
Remember the pairplot we made in the first chapter?
3. Pairwise correlation
It allowed us to visually identify strongly correlated features. However, if we want to quantify the correlation between features, this method would fall short.
4. Correlation coefficient
To solve this, we need to have a measure for the strength of the correlation, this is where the correlation coefficient r comes in. The value of r always lies between minus one and plus one. Minus one describes a perfectly negative correlation, zero describes no correlation at all and plus one stands for a perfect positive correlation.
5. Correlation coefficient
When the relation between two features shows more variance, as is usually the case in real-world data, the correlation coefficients will be a bit closer to zero.
6. Correlation matrix
We can calculate correlation coefficients on pandas DataFrames with the .corr() method.
If we call it on the dataset the pairplot was built on, we'd get a so-called correlation matrix. It shows the correlation coefficient for each pairwise combination of features in the dataset.
7. Correlation matrix
In fact, it even shows every pairwise correlation coefficient twice, since the correlation of A to B equals that of B to A.
8. Correlation matrix
Perfectly correlated features, such as weight in kilograms and weight in pounds marked in red here, get a correlation coefficient of one. Meaning that if you know one feature, you can perfectly predict the other for this dataset.
9. Correlation matrix
By definition, the diagonal in our correlation matrix shows a series of ones, telling us that, not surprisingly, each feature is perfectly correlated to itself.
10. Visualizing the correlation matrix
We can visualize this simple correlation matrix using Seaborn's heatmap() function.
We've passed a custom color palette and some styling arguments to this function to get a nice looking plot. We can improve this plot further by removing duplicate and unnecessary information like the correlation coefficients of one on the diagonal.
11. Visualizing the correlation matrix
To do so we'll create a boolean mask. We use NumPy's ones_like() function to create a matrix filled with True values with the same dimensions as our correlation matrix and then pass this to NumPy's triu(), for triangle upper, function to set all non-upper triangle values to False.
12. Visualizing the correlation matrix
When we pass this mask to the heatmap() function it will ignore the upper triangle, allowing us to focus on the interesting part of the plot.
13. Visualising the correlation matrix
When we apply this method to a slightly larger subset of the ANSUR data we can instantly spot that chest height is not correlated to the hip breadth while sitting, and that the suprasternale height is very strongly correlated to the chest height.
14. Let's practice!
Now it's your turn to use the correlation matrix to gain insight into a dataset.