Session Ready
Exercise

R-squared goes up

Recall that R-squared is the variance in the model output divided by the variance in the actual response values. It is almost always calculated on the training data.

In cross validation, we use a training dataset to train the model and a separate testing dataset to evaluate the model performance. This is because model performance tends to look better on the training data than on new cases and we're often interested in anticipating the performance on new data rather than the training data. Using cross validation allows us to compare the performance of different models in a fair way.

Similarly, R-squared can be difficult to use to compare different models. In this exercise, you'll see that R-squared goes up as new explanatory variables are added, even if those explanatory variables are meaningless. In this supplementary video, I give an example of how to interpret R-squared values. Go ahead and watch it if you'd like!

Instructions
100 XP
  • Train model_1 with the formula wage ~ sector on the Training data using an lm() architecture.
  • Train model_2 the same as model_1, but add the variable bogus to the formula. bogus contains a set of completely meaningless random variables.
  • Calculate R-squared for both models. Note that R-squared goes up substantially from model_1 to model_2, even though bogus has no predictive value.
  • To get a fair comparison of the predictive performance of the models, compare the cross-validated mean square errors for model_1 and model_2 with a boxplot(). Note that the prediction error for model_2 is worse than for model_1. Adding random junk as explanatory variables actually worsens predictions, but R-squared doesn't show this.