Exercise

# R-squared goes up

Recall that R-squared is the variance in the model output divided by the variance in the actual response values. It is almost always calculated on the *training* data.

In cross validation, we use a training dataset to train the model and a separate testing dataset to evaluate the model performance. This is because model performance tends to look better on the training data than on new cases and we're often interested in anticipating the performance on new data rather than the training data. Using cross validation allows us to compare the performance of different models in a fair way.

Similarly, R-squared can be difficult to use to compare different models. In this exercise, you'll see that R-squared goes up as new explanatory variables are added, *even if those explanatory variables are meaningless*. In this supplementary video, I give an example of how to interpret R-squared values. Go ahead and watch it if you'd like!

Instructions

**100 XP**

- Train
`model_1`

with the formula`wage ~ sector`

on the`Training`

data using an`lm()`

architecture. - Train
`model_2`

the same as`model_1`

, but add the variable`bogus`

to the formula.`bogus`

contains a set of completely meaningless random variables. - Calculate R-squared for both models. Note that R-squared goes up substantially from
`model_1`

to`model_2`

, even though`bogus`

has no predictive value. - To get a fair comparison of the predictive performance of the models, compare the cross-validated mean square errors for
`model_1`

and`model_2`

with a`boxplot()`

. Note that the prediction error for`model_2`

is*worse*than for`model_1`

. Adding random junk as explanatory variables actually*worsens*predictions, but R-squared doesn't show this.