1. Two or more variables
In this lesson, we'll talk more about analyzing the relationship between variables - this time in the context of two or more numerical variables. We'll go over different types of relationships, correlation, and more.
2. Types of relationships
By using scatter plots to compare one variable against another, we can get a feel for the kind of relationship we're dealing with. Here we see two common distinctions: strong or weak and positive or negative. Note that not all relationships are linear, they can be quadratic or exponential as well, and often there is no apparent relation between the variables at all.
3. What is correlation?
You should be familiar with correlation, but let's briefly review. Correlation describes the relatedness between variables, meaning how much information variables reveal about each other. If two variables are positively correlated, like we see on the far left, then if one increases, the other is likely to increase as well. In python, the scatter, pair plot, and corr functions are helpful here.
4. Covariance
Before we get further into correlation, we first must look at an important statistical building block: covariance. As the formula shows, the covariance of two variables is calculated as the average of the product between the values from each sample, where the values have each had their mean subtracted.
However, covariance falls short when it comes to interpretability, since we can't get anything from the magnitude. It's mainly important because we can use the covariance to calculate the Pearson correlation coefficient - a metric that's much more interpretable and important for the interview.
5. Pearson's correlation
To get the Pearson's correlation coefficient, denoted by a lower case r, we take the covariance function and divide it by the product of sample standard deviations of each variable.
6. Pearson's correlation
We can see how much more interpretable this value is than covariance. A positive value means there is a positive relationship, while a negative value means there's a negative relationship. A value of 1 means there is a perfectly correlated relationship, while a value of 0 means there is no correlation.
Notice that the values only fall between positive 1 and negative 1.
You may also encounter this concept through the R squared value, which is simply the Pearson correlation squared. R squared is often interpreted as the amount of variable Y that is explained by X and is great to include in interview answers.
7. Correlation vs. causation
This brings us to correlation and causation. We've covered a few technical concepts here, but let's switch gears. Our goal is often to find out if there is a relationship between two variables, that is, does information about one of the variables tell us more about it's counterpart?
But how do we know that the variables are actually related and don't just appear that way due to chance? How can we be sure that one variable actually causes the other?
8. Correlation vs. causation
Here's a funny example using real data that shows a strong correlation between divorce rate and consumption of margarine. Does this mean that margarine causes divorce, or are they simply just correlated? This is obviously a bit extreme, but without a well-executed experiment, it can often be tough to tell. Interviewers may probe at your intuition of around this topic.
9. Summary
That wraps up this lesson on analyzing the relationship between two or more variables. We talked about the types of different relationships and reviewed correlation, covariance, and Pearson's correlation coefficient. Finally, we touched on correlation versus causation.
10. Let's prepare for the interview!
Let's go ahead and practice this in python!