1. Predicting teaching score using age
Let's take our basic linear regression model of score as a function of age and now use it for predictive ends. For example, say we have demographic information about a professor at the UT Austin, can you make a good guess of their score? Or more generally, based on a single predictor variable x, can you make good predictions y-hat?
2. Refresher: Regression line
Recall our "best-fitting" regression line from the last video. Now say all you know about an instructor is that they are aged 40. What is a good guess of their score? Can you use the above visualization?
3. New instructor prediction
A good guess is the fitted value on the regression line for age = 40, marked by the square. It seems that this is roughly 4.25. To compute this precisely however, you need use the fitted intercept and fitted slope for age from the regression table.
4. Refresher: Regression table
Previously, you learned how to fit a linear regression model with one numerical explanatory variable and applied the get_regression_table() function from the moderndive package to obtain the fitted intercept and fitted slopes values.
Recall that these values are in the estimate column and 4.46 and -0.006 respectively
5. Predicted value
Generally, you can use a fitted regression model f-hat for predictive as well as explanatory purposes.
Specific to our model, I've compute the predicted score via the equation 4.46 minus 0.006 times age.
So our new instructor, using this model, is predicted to get a score of 4.22, very close to my earlier visual prediction of 4.25.
6. Prediction error
Now say we find out the instructor got a score of 3.5, marked with a circle. Our prediction of 4.25 over-predicted! Let's mark the magnitude of the error with an arrow.
7. Prediction error
The length of this arrow is about 0.75 units. While the direction of the arrow was somewhat arbitrarily chosen, I set it to point downwards, indicating a negative error. What I've just illustrated is the modeling concept of a residual.
8. Residuals as model errors
A residual is the observed value y minus the fitted/predicted value y-hat.
This discrepancy between the two corresponds to the epsilon from the general modeling framework.
Here the negative residual of -0.72 corresponds to our over-prediction.
With linear regression, sometimes you'll obtain positive residuals and other times negative. In linear regression, these residuals average out to zero.
Now say you want predicted y-hats and residuals for ALL 463 instructors. You could repeat the procedure we just followed 463 times, but this would be tedious, so let's automate this procedure using another wrapper function from the moderndive package.
9. Computing all predicted values
Recall our earlier fitted linear model saved in model_score_1.
Instead of using get_regression_table(), let's now use the get_regression_points() function to get information on all 463 points in our dataset.
The first column ID identifies the rows.
The 2nd and third columns are the original outcome and explanatory variables score and age.
The fourth column score-hat is the predicted y-hat, as computed using the equation of the regression line.
The fifth column is the residual: score minus score-hat.
Let's go back to our original scatterplot and illustrate a few more residuals.
10. "Best fitting" regression line
Here, I'll plot 6 arbitrarily chosen residuals. Our earlier statement that the regression line is "best-fitting" means that of all possible lines, the blue regression line "minimizes" the residuals.
What do I mean by "minimize"? Imagine you drew all 463 residuals, squared their lengths so that positive and negative residuals were treated equally,
then summed them.
The regression line is the line that minimizes this quantity.
You'll see later that this quantity is called the sum-of-squared-residuals and it measures the "lack-of-fit" of a model to a set of points.
The blue regression line is "best" in that it minimizes this lack-of-fit.
But first, time for some exercises!
11. Let's practice!
You'll be making predictions of teaching score of your own, however this time not using age as a predictor, but rather beauty score.