Exercise

Evaluate a modeling procedure using n-fold cross-validation

In this exercise, you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg$cty from mpg$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

# Initialize a column of the appropriate length
dframe$pred.cv <- 0 

# k is the number of folds
# splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

  # Build a model on the training data 
  # from this split 
  # (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the 
  # application data from this split
  dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

The data frame mpg, the cross validation plan splitPlan, and the rmse() function have been pre-loaded.

Instructions

100 XP
  • Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg$pred.cv.
    • Use lm() and the formula cty ~ hwy.
  • Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to mpg$pred.
  • Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.
  • Get the root mean squared error of the cross-validation predictions. Are the two values about the same?