1. Correlation
We've talked about relationships between variables; now let's look at one way to measure relationships - correlation.
2. Relationships between two variables
Recall that we can use a scatter plot to visualize relationships. Here we plot the costs of water versus gym memberships in different cities.
It's hard to determine whether a clear relationship exists between these two variables.
3. Pearson correlation coefficient
This is where the Pearson correlation coefficient, often referred to as the correlation coefficient, comes in handy. It was developed by Karl Pearson and published back in 1896!
It quantifies the strength of a relationship between two variables, producing a value between minus one and one. This number corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship.
4. Linear relationships
Note that the Pearson correlation coefficient can only be used for linear relationships, meaning changes between variables are proportionate.
For example, let's say that a bottle of water costs one dollar and the monthly price of a gym membership is twenty dollars in London. If water costs twice as much in Paris then a gym membership should cost 40 dollars.
5. Values = strength of the relationship
Here's a scatterplot of two variables, x and y, that have a correlation coefficient of 0.99.
The data is closely clustered around a diagonal line, so we describe this as a near-perfect or very strong relationship. If we know the value of x, we'll have a good idea of what the value of y could be.
6. Values = strength of the relationship
Comparing this to a correlation coefficient of 0.75, the data points still trend up and to the right, but are more spread out.
7. Values = strength of the relationship
This plot shows a correlation of 0.56, which would be considered a moderate relationship.
8. Values = strength of the relationship
A correlation coefficient around 0.2 would be considered a weak relationship.
9. Values = strength of the relationship
When the correlation coefficient is close to zero, x and y have no relationship and the scatterplot looks completely random.
This means that knowing the value of x doesn't tell us anything about the value of y.
10. Sign = direction
The sign of the correlation coefficient corresponds to the direction of the relationship. A positive correlation coefficient indicates that as x increases, y also increases. A negative correlation coefficient indicates that as x increases, y decreases.
11. Gym costs vs. water costs
Given what we now know about correlation, what do we think the correlation coefficient is between water costs and gym costs?
Well, there isn't a clear line, suggesting it isn't a very strong relationship, but the values both tend to increase together. So, perhaps there is a weak-to-moderate positive correlation.
12. Adding a trendline
A trendline makes it easier to visualize the relationship. The Pearson correlation coefficient is 0.35, confirming a weak to moderate positive relationship between the cost of a gym membership and the cost of a bottle of water.
13. Life expectancy vs. cost of a bottle of water
Be careful when interpreting the relationship between variables using the correlation coefficient.
Here is a plot of life expectancy and the cost of a bottle of water. There is a correlation coefficient of 0.61, suggesting a moderate positive relationship.
14. Correlation does not equal causation
Does this mean increasing the cost of water will increase life expectancy?
Well, it is important to distinguish that just because a relationship exists, it doesn't mean that changes in water costs will result in a change in life expectancy.
A popular phrase among statisticians is that correlation does not equal causation.
15. Confounding variables
When looking at relationships among data, it is important to ask what else might be affecting the values.
The cost of a bottle of water is typically higher in locations with stronger economies, and they may offer better access to high quality healthcare. So perhaps life expectancy is not affected by the cost of a bottle of water, it is actually affected by the strength of the economy.
This is known as a confounding variable, which is something that affects the data we are analyzing, but was not accounted for when assessing the relationship between variables.
16. Let's practice!
Now let's see how strong the relationship is between watching this video and understanding correlation!