1. Correlation
So far we have summarized the variation in the data. Now let's explore the relationships between different groups of data. For example, what is the relationship between crime rates in different precincts? Is the number of crimes committed in one precinct significantly higher than another? We'll discuss what correlations do and don't tell us about the relationships between groups.
2. Correlation
Correlation refers to the relationship between two groups or variables.
We typically quantify this relationship with a correlation coefficient, sometimes denoted as "r".
Correlation coefficients vary between -1 and 1.
Positive correlations indicate that two variables increase or decrease in tandem. Think of height and weight for example. Taller people typically weigh more.
Negative correlations, by contrast, move in opposite directions. For example, studies suggest that more time playing video games is associated with lower grades for students.
A correlation coefficient near 0 indicates little or no relationship between variables.
3. The CORREL() function
We use the CORREL() function to calculate correlations.
CORREL() takes two arguments, which are the two ranges whose values we want to correlate. Both ranges must include only numeric values.
For example, if you want to correlate the values in columns G and H,
simply call the CORREL() function and specify those two columns.
4. Evaluating strength of relationships
In the previous example, the correlation coefficient between columns G and H was 0.69. What does that mean? Generally, we categorize correlations as weak, moderate, or strong.
Weak correlations have coefficients between -0.3 and positive 0.3. They suggest there is no meaningful connection between the two groups.
Moderate correlations have coefficients between -0.3 and -0.7 or positive 0.3 and positive 0.7. These coefficients suggest some connection between the two variables but other explanatory information may be needed.
Strong correlations have coefficients less than -0.7 or greater than 0.7. These coefficients suggest a meaningful relationship. As we'll see shortly, we must use judgment when interpreting correlation coefficients.
5. Visualizing positive correlations
Let's see what these correlations look like in practice. Here there is a positive relationship in the data. As the x values increase, so too do the y values.
6. Visualizing negative correlations
Now examine these data. The downward sloping trendline suggests that as one variable increases, the other decreases.
7. Visualizing near-zero correlations
What about when the data exhibit no pattern? Here the flat trendline indicates that there is no relationship between the x and y values.
8. Problems with correlation
Correlation is a powerful tool for understanding relationships between values. However, there are a few problems with correlation we should keep in mind when interpreting results. Consider an example.
9. Problems with correlation
Ice cream consumption and drowning deaths are moderately correlated, with a coefficient of 0.5. The peaks you see in the plot are the summer months.
Does this mean eating more ice cream increases the likelihood of drowning?
Not necessarily.
For example, we don't know whether eating more ice cream causes people to drown
or whether people drown their sorrows by eating ice cream after loved ones drown.
Furthermore, it could be that a third variable is responsible for causing both things to change.
In our example, people swim more in the summer months and eat more ice cream in the summer months when the temperature rises, so this is potentially a mediating factor for why a correlation exists between ice cream consumption and drowning deaths. First, correlation coefficients indicate a RELATIONSHIP between groups of data. They do
NOT indicate that changes in one variable CAUSE changes in another.
10. Let's practice!
With that, you're ready to go! Let's get correlating.