Get startedGet started for free

Correlation tests

1. Correlation tests

In this video, we'll discuss correlations in data. Correlated data exists everywhere. Learning how to interpret correlations is key for making sound data-driven decisions.

2. Correlation

We know from earlier courses that correlation refers to a statistical relationship between two variables. We often think of correlation as referring to a linear relationship, but this isn't always the case. Any time a change in one set of data helps describe the change in another data set, the two data sets are said to be correlated. Our task in this lesson will be to explore how inferences can be made based on correlated data.

3. Rent prices in Chicago

Correlation can be very beneficial. Let's look at data showing the rent prices in Chicago, using a normalized rent index, and the date on the x-axis. We see a clear trend, but how can we gain insight into the pattern of rent prices we're observing?

4. Rent prices in Chicago versus USA

Suppose we compared rent prices in Chicago to the USA as a whole, with the date on the x-axis. We can see both moving roughly in unison. Thus while we see an upward trend in Chicago housing prices, we can see that this is not unique to Chicago. Instead, we could focus our attention how much price changes in Chicago are simply reflective of the overall market in the United States, and how much they are unique to Chicago. Doing so let's us make focused inference on Chicago itself.

5. Pearson's R in SciPy

First, let's compute Pearson's R using the SciPy function pearsonr. It returns both R, as well as a p-value. Here we see a very large value of R, indicating that both samples are very highly positively linearly correlated. The p-value is testing the null hypothesis that the samples are uncorrelated. As expected, that hypothesis is rejected, indicating that rents in the USA as a whole and in Chicago are indeed correlated.

6. Explained variance

Going back to inference on Chicago rent, to what extent are Chicago rent prices reflective of the overall trend in the US? Recall that R-squared, or squaring Pearson's R, tells us the percent of variation in one sample that is explained by knowing the other sample. In our example, that means that, by knowing the rent prices in the USA as a whole, we can explain 88-point-3 percent of the variation in rent prices in Chicago. Therefore, rent price changes in Chicago are largely explained by the US trend.

7. Inference from correlation

Therefore, our goal would be to look into other factors that may be unique to Chicago. This could include factors like differences in job prospects, weather, taxes, and other areas where Chicago may differ from the US as a whole. By focusing on what is unique to Chicago we can try to explain the remaining twelve percent of the variation that is still unexplained.

8. Drawbacks of correlation

However, correlations can be hidden within our data. Failure to recognize and address these can cause hypothesis tests to give invalid conclusions. In particular, many hypothesis tests, such as a t-test, assume the samples are independent. However, sometimes data can be correlated with itself. For example, rent prices today depend strongly on rent prices yesterday, and don't simply fluctuate independently. Therefore, our data violates the assumption of normality required to conduct a t-test on this data.

9. Autocorrelation

Finally, data can be correlated with past values of itself. We call this autocorrelation. By viewing several steps, or lags, in the past, we can compare data to itself. Here we compare rent prices year-over-year in Chicago and see an interesting positive relationship.

10. Let's practice!

Let's apply these techniques to further investigate this data.