Get startedGet started for free

Correlation

1. Correlation

Welcome to the final chapter of the course, where we'll talk about correlation and experimental design.

2. Relationships between two variables

Before we dive in, let's talk about relationships between numeric variables. We can visualize these kinds of relationships with scatterplots - in this scatterplot, we can see the relationship between the total amount of sleep mammals get and the amount of REM sleep they get. The variable on the x-axis is called the explanatory or independent variable, and the variable on the y-axis is called the response or dependent variable.

3. Correlation coefficient

We can also examine relationships between two numeric variables using a number called the correlation coefficient. This is a number between -1 and 1, where the magnitude corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship.

4. Magnitude = strength of relationship

Here's a scatterplot of 2 variables, x and y, that have a correlation coefficient of 0-point-99. Since the data points are closely clustered around a line, we can describe this as a near-perfect or very strong relationship. If we know what x is, we'll have a pretty good idea of what the value of y could be.

5. Magnitude = strength of relationship

Here, x and y have a correlation coefficient of 0-point-75, and the data points are more spread out.

6. Magnitude = strength of relationship

In this plot, x and y have a correlation of 0-point-56 and are therefore moderately correlated.

7. Magnitude = strength of relationship

A correlation coefficient around 0-point-2 would be considered a weak relationship.

8. Magnitude = strength of relationship

When the correlation coefficient is close to 0, x and y have no relationship and the scatterplot looks completely random. This means that knowing the value of x doesn't tell us anything about the value of y.

9. Sign = direction

The sign of the correlation coefficient corresponds to the direction of the relationship. A positive correlation coefficient indicates that as x increases, y also increases. A negative correlation coefficient indicates that as x increases, y decreases.

10. Visualizing relationships

To visualize relationships between two variables, we can use a scatterplot created using geom_point.

11. Adding a trendline

We can add a linear trendline to the scatterplot using geom_smooth. We'll set the method argument to "lm" to indicate that we want a linear trendline, and se to FALSE so that there aren't error margins around the line. Trendlines like this can be helpful to more easily see a relationship between two variables.

12. Computing correlation

To calculate the correlation coefficient between two variables in R, we can use the cor function. The cor function takes in two numeric vectors and will return their correlation coefficient. Note that it doesn't matter which order the vectors are passed into the function since the correlation between x and y is the same thing as the correlation between y and x.

13. Correlation with missing values

If you have any missing values in either variable, R will return NA when you calculate correlation. To ignore data points where one or both values are missing, set the use argument of cor to pairwise-dot-complete-dot-obs.

14. Many ways to calculate correlation

There's more than one way to calculate correlation, but the method we've been using in this video is called the Pearson product-moment correlation, which is also written as r. This is the most commonly used measure of correlation. Mathematically, it's calculated using this formula where x and y bar are the means of x and y. The formula itself isn't important to memorize, but know that there are variations of this formula that measure correlation a bit differently, such as Kendall's tau and Spearman's rho, but those are beyond the scope of this course.

15. Let's practice!

Okay, time to practice calculating correlations.