Session Ready
Exercise

A better model?

In the previous two exercises, you compared a null model of start_position to a model using age and sex as explanatory variables. You didn't use cross validation, so the calculated error rates are biased to be low. In this exercise, you'll apply a simple cross validation test: splitting the data into training and testing sets.

Your job is to evaluate the models on the testing sets and calculate the error rate.

A hint about interpreting the results: it's often the case that explanatory variables that you think should contribute to prediction in fact do not. Being able to reliably discern when potential explanatory variables do not help is a key skill in modeling.

Instructions
100 XP

Training_data and Testing_data have been pre-loaded in your workspace.

  • Examine the code used to train three models on the training cases: a null model with all_the_same as the only "explanatory" variable (it's always equal to 1), a model with age as the only explanatory variable, and a model with both age and sex as explanatory variables.
  • Evaluate each of the three models on the testing cases.
  • Calculate the prediction error rate for each model.
  • Print the error rates to the console and think about the following:
    • Does adding age as an explanatory variable improve predictions over the null model?
    • Does adding sex improve predictions over just using age?
    • A proper calculation would repeat the random division into training and testing several times in order to be able to decide if the models have prediction errors that are statistically different. This is what cv_pred_error() does.