Session Ready
Exercise

Testing and training datasets

In this exercise, you'll see one way to split your data into non-overlapping training and testing groups. Of course, the split will be done at random so that the testing and training data are similar in a statistical sense.

The code in the editor uses a style that will give you two prediction error results: one for the training cases and one for the testing cases. Your goal is to see whether there is a systematic difference between prediction accuracy on the training and on the testing cases.

Since the split is being done at random, the results will vary somewhat each time you do the calculation. As you'll see in later exercises, you deal with this randomness by rerunning the calculation many times.

Instructions
100 XP
  • Examine the code that adds a column named training_cases to Runners_100 consisting of random TRUEs and FALSEs. The TRUEs will be the training cases and the FALSEs will be the testing cases.
  • Build the linear model net ~ age + sex on the training data. The code for subsetting Runners_100 is provided for you.
  • Use evaluate_model() to find the model predictions on the testing data. You can use a similar call to subset() on Runners_100, replacing training_cases with !training_cases (i.e. "not training cases").
  • Calculate the mean square error (MSE) on the testing data. This will be the mean of (net - model_output)^2.