Get startedGet started for free

Correlation

1. Correlation

You did well on these exercises! In this lesson we will look at the correlation between columns, and what conclusions we can draw from this.

2. df.corr()

Pandas can calculate the correlation between columns by using the .corr() method on a DataFrame. The output is a matrix with the original DataFrame columns on both the row and column axis. The values range from 1 to -1, where 1 is the maximum positive correlation, and -1 is the maximum negative correlation. The diagonal is always 1 since each column is fully correlated to itself. In the output, humidity and temperature are negatively correlated. This means that as temperature rises, humidity is declining.

3. heatmap

We can also visualize the correlation in a heatmap, which is usually simpler to read than a table. The seaborn module provides a heatmap-function, which takes the output of corr() as first argument. We also specify annot=True. This will write the degree of correlation into each of the heatmap fields.

4. heatmap

We can clearly see the two almost black fields on the top left, which show the high negative correlation between temperature and humidity.

5. heatmap

We can also see a positive correlation between temperature and sunshine in column 3, so with higher temperature, the probability of having sunshine also rises.

6. heatmap

Because traffic for both light and heavy vehicles increases and decreases almost simultaneously throughout the day, light and heavy vehicles have a correlation of almost 1.

7. Pairplot

Another useful tool when analyzing data can be the pairplot. A pairplot shows histograms on the diagonal line from the top left to the bottom right, and scatterplots in the other cells, with always one column of the dataset on the x axis, and another column on the y axis. From the image, we get confirmation that the two vehicle columns for light and heavy vehicles are strongly correlated. We also see that humidity and temperature are negatively correlated, but not as strongly as the two vehicle count columns, which confirms the correlation we discovered earlier. On the other hand, temperature and humidity don't seem to have a very tight correlation to the two vehicle columns, since the distribution is quite well distributed.

8. Summary

In this lesson, we've seen correlation plots, plotted as a heatmap. We've seen that temperature and humidity are negatively correlated, so if one rises, the other falls and vice versa. We've also seen the opposite between the sunshine duration and temperature columns: if one rises, the other is likely to rise too. An extreme example of correlation was between the vehicle columns, which had a correlation close to 1. Both columns move almost like one, so it might make sense to combine the 2 columns into one if we would apply machine learning onto this dataset, since they don't provide enough different information.

9. Let's practice!

And now it's your turn to detect some correlations.