Get startedGet started for free

Quantifying Linear Relationships

1. Quantifying Linear Relationships

In previous exercises, we've used data visualization to explore the relationship between two variables. In this lesson, we introduce methods from *descriptive* statistics, including "correlation", as a way of *quantifying* linear trends in the data.

2. Pre-Visualization

Before reviewing any statistics, let's pause and note that visualization is always a great first step. Here, plotting 3 data sets reveals 3 very different trends. The data on the left is said to be highly correlated. As x increases, y increases with it. The linear trend is apparent. The data on the far right shows that as x changes, y does not change in the same way. For this data, x and y are said to be "not correlated". The data in the middle is ambiguous. The correlation value is a quantitative measure of how strong of a linear relationship there is between two variables in your data.

3. Review of Single Variable Statistics

To understand correlation, we need to step back and review some statistics. In previous courses, you saw how to compute measures of central tendency and spread of a single variable. The mean is a measure of the center. For a measure of spread, try subtracting the mean from every data point: the results are called deviations. If we average these, they tend to cancel out to zero, so we square them first and then average. The result is called the variance. But now the units are not the same as the data, so we take the square root. The result is the standard deviation.

4. Covariance

While the variance measures how a single variable varies, covariance measures how two variables "vary together". To compute it, first compute the deviation arrays, dx and dy, from each of two arrays, x and y. Then, take the product of each pair of deviations, and lastly, average all those products. For each deviation product, if both x and y are varying in the same *direction* the result is positive. If they vary in opposite direction, the product is negative. The average of those products will be larger if both variables change in the same direction more often than not. But, as with the variance, covariance can be difficult to interpret and compare.

5. Correlation

If we divide each deviation by the variables standard deviation, the result is the covariance of the normalized deviations, or "correlation" But why "normalize"?

6. Normalization: Before

The problem in comparing two variables is that the different center and spread make covariance harder to interpret, and harder to compare to other data sets. The figure here shows two variables. They have different center and spread.

7. Normalization: After

Here is what the two variables look like after "normalization". Both have mean zero and standard deviation of one. Imagine these are those deviations. Now no one variable is weighted more heavily in the product. A note on terms: formally, "normalization" may denote only the rescaling, without re-centering, but in practice, it is common to hear it used to denote both.

8. Magnitude versus Direction

Correlation always ranges from -1 to +1. Here we see 6 different data sets each have a different correlation value. Think of correlation in two parts: First, the magnitude, 1 to 0. This decreases from left to right, across both rows. Second, the sign: +1 or -1. The top row all have positive correlation, the bottom row all negative. Positively correlated means as one variable goes up, the other goes up. Negatively correlated means as one goes up, the other goes down.

9. Let's practice!

Now, let's try to put it all together, by working some examples in code, with each building on the next, so that in the end, you can see how correlation works.