1. Predicting teaching score using gender
You'll finish your exploration of basic regression, with modeling for prediction using one categorical predictor variable. The idea remains the same as predicting with one numerical variable: based on information contained in the predictor variables, in this case gender, can you accurately guess teaching scores?
2. Group means as predictions
You previously computed gender-specific means using group_by() and summarize(). This time, however, let's also compute the standard deviation, which is a measure of the variation/spread.
On average, the male instructors got a score 4.23 and the female got a score of 4.09. Furthermore, there is variation around the mean scores as evidenced by the standard deviations. The women had slightly more variation with a standard deviation of .564.
So say you had an instructor at the UT Austin and you knew nothing about them other than that they were male. A reasonable prediction of their teaching score would be the group mean for the men of 4.23.
However, surely there are more factors associated with teaching scores than just gender. How good can this prediction be? There must be a fair amount of error involved.
What was our method for quantifying error? It was the residuals! Let's get the predicted values and residuals for all 463 instructors.
3. Computing all predicted values and residuals
We use the get_regression_points() function I introduced earlier. Recall that whereas get_regression_table() returns the regression table, get_regression_points() returns information on each of the 463 rows in the evals dataframe, each representing one instructor.
In the 2nd column are the observed y outcome variable: score.
In the fourth column score_hat, observe that there are only two possible predicted values: either 4.09 or 4.23, corresponding to the group means for the women and men respectively.
The final column are the residuals, which are the observed values y minus the predicted values y-hat, or here, score minus score-hat.
In the first row, since the prediction 4.09 was lower than the observed value of +4.7, there's a positive residual, whereas in the third row, since the prediction of 4.09 was greater than the observed 3.9, there's a negative residual. Say you observed a predicted value that was exactly equal to the observed value, then in this case the residual would be 0.
4. Histogram of residuals
This time, let's take the output of the get_regression_points() function and save it in a new dataframe called model_score_3_points. This way you can use this dataframe in a ggplot() function call to plot a histogram of the residuals.
5. Histogram of residuals
This histogram appears to be roughly centered at 0, indicating on average the residuals, or errors, are 0. Sometimes I make errors as large as +1 or -1 out of a 5-unit scale! Those are fairly large! Also note that I tend to make larger negative errors than positive errors.
But these shortcomings aren't necessarily bad. Remember, this is a very simplistic model with only one predictor: gender. The analysis of the residuals above is suggesting to us that we probably need more predictors than just gender to make good predictions of teaching scores.
Wouldn't it be great if you could fit regression models where you could use more than one explanatory or predictor variable?
You'll see in the upcoming Chapter 3 on multiple regression, that you can! But first, let's do some exercises!
6. Let's practice!
Let’s make predictions using the categorical predictor variable rank and study the resulting residuals!