1. Regression to the mean
Let's take a short break from thinking about regression modeling, to a related concept called "regression to the mean". Regression to the mean is a property of the data, not a type of model, but linear regression can be used to quantify its effect.
2. The concept
You already saw that each response value in your dataset is equal to the sum of a fitted value, that is, the prediction by the model, and a residual, which is how much the model missed by.
Loosely speaking, these two values are the parts of the response that you've explained why it has that value, and the parts you couldn't explain with your model.
There are two possibilities for why you have a residual. Firstly, it could just be because your model isn't great. Particularly in the case of simple linear regression where you only have one explanatory variable, there is often room for improvement. However, it usually isn't possible or desirable to have a perfect model because the world contains a lot of randomness, and your model shouldn't capture that.
In particular, extreme responses are often due to randomness or luck. That means that extremes don't persist over time, because eventually the luck runs out.
This is the concept of regression to the mean. Eventually, extreme cases will look more like average cases.
3. Pearson's father son dataset
Here's a classic dataset on the heights of fathers and their sons, collected by Karl Pearson, the statistician who the Pearson correlation coefficient is named after.
The dataset consists of over a thousand pairs of heights, and was collected as part of nineteenth century scientific work on biological inheritance. It lets us answer the question, "do tall fathers have tall sons?", and "do short fathers have short sons?".
4. Scatter plot
Here's a scatter plot of the sons' heights versus the fathers' heights. I've added a line where the sons and fathers heights are equal, using geom_abline. The color and size arguments are used to help it stand out.
I also used coord_fixed, so that one centimeter on the x-axis appears the same as one centimeter on the y-axis.
If sons always had the same height as their fathers, all the points would lie on this green line.
5. Adding a regression line
Let's add a linear regression line to the plot. You can see that the regression line isn't as steep as the first line.
On the left of the plot, the blue line is above the green line, suggesting that for very short fathers, their sons are taller than them on average.
On the far right of the plot, the blue line is below the green line, suggesting that for very tall fathers, their sons are shorter than them on average.
6. Running a regression
Running a model quantifies the predictions of how much taller or shorter the sons will be.
Here, the sons' heights are the response variable, and the fathers' heights are the explanatory variable.
7. Making predictions
Now we can make predictions. Consider the case of a really tall father, at one hundred and ninety centimeters. At least, that was really tall in the late nineteenth century. The predicted height of the son is one hundred and eighty three centimeters. Tall, but but quite as tall as his dad.
Similarly, the prediction for a one hundred and fifty centimeter father is one hundred and sixty three centimeters. Short, but not quite as short as his dad.
In both cases, the extreme value became less extreme in the next generation.
8. Let's practice!
Time to apply regression to the mean to sports and finance.