1. Covariance and the Pearson correlation coefficient
We have more data than just the vote share for Obama. We also know the total number of votes in each county. Let's look at how these two quantities vary together.
2. 2008 US swing state election results
We start by looking at a scatter plot of the county data for the three swing states, plotting the percent vote for Obama versus the total number of votes in each county. Immediately from the scatter plot, we see that
3. 2008 US swing state election results
the twelve most populous counties all voted for Obama, and that most of the counties
4. 2008 US swing state election results
with small populations voted for McCain.
5. Generating a scatter plot
To generate a scatter plot, we plot the data as points by setting the marker and linestyle keyword arguments of plt.plot. (And of course we label the axes!) So, we have exposed another graphical EDA technique: scatter plots!
We would like to have a summary statistic to go along with the information we have just gleaned from the scatter plot. We want a number that summarizes how Obama's vote share varies with the total vote count.
6. Covariance
One such statistic is the covariance.
To understand where it comes from,
7. Calculation of the covariance
let's annotate the scatter plot with the means of the two quantities we are interested in. Now let's look at
8. Calculation of the covariance
this data point, from Lucas County, Ohio. This data point
9. Calculation of the covariance
differs from the mean vote share for Obama, and
10. Calculation of the covariance
the mean total votes.
We can compute these differences for each data point. The covariance is the mean of the product of these differences. If x and y both tend to be above, or both below their respective means together, as they are in this data set, then the covariance is positive. This means that they are positively correlated: when x is high, so is y; when the county is populous, it has more votes for Obama. Conversely, if x is high while y is low, the covariance is negative, and the data are negatively correlated, or anticorrelated, which is not the case for this data set.
We can compute the covariance using built-in NumPy functions you will use in the exercises. However, if we want to have a more generally applicable measure of how two variables depend on each other, we want it to be dimensionless, that is to not have any units.
11. Pearson correlation coefficient
So, we can divide the covariance by the standard deviations of the x and y variables.
This is called the Pearson correlation coefficient, usually denoted by the Greek letter rho. It is a comparison of the variability in the data due to codependence (the covariance) to the variability inherent to each variable independently (their standard deviations).
Conveniently, it is dimensionless and ranges from -1 (for complete anticorrelation) to 1 (for complete correlation).
12. Pearson correlation coefficient examples
A value of zero means that there is no correlation at all between the data, as shown in the plot on the upper left. Data with intermediate values are shown on the other plots. As you can see, the Pearson correlation coefficient is a good metric for correlation between two variables.
13. Let's practice!
Now that you know how what the Pearson correlation coefficient is and what it means, you can compute it in the exercises using Python. You will then have an added tool in your EDA summary statistics toolbox. Let's do it!