Get startedGet started for free

Evaluate a modeling procedure using n-fold cross-validation

In this exercise, you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg$cty from mpg$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

# Initialize a column of the appropriate length
dframe$pred.cv <- 0 

# k is the number of folds
# splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

  # Build a model on the training data 
  # from this split 
  # (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the 
  # application data from this split
  dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

The data frame mpg, the cross validation plan splitPlan, and the rmse() function have been pre-loaded.

This exercise is part of the course

Supervised Learning in R: Regression

View Course

Exercise instructions

  • Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg$pred.cv.
    • Use lm() and the formula cty ~ hwy.
  • Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to mpg$pred.
  • Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.
  • Get the root mean squared error of the cross-validation predictions. Are the two values about the same?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# mpg is available
summary(mpg)

# splitPlan is available
str(splitPlan)

# Run the 3-fold cross validation plan from splitPlan
k <- ___ # Number of folds
mpg$pred.cv <- 0 
for(i in ___) {
  split <- ___
  model <- lm(___, data = ___)
  mpg$pred.cv[___] <- predict(___, newdata = ___)
}

# Predict from a full model
mpg$pred <- ___(___(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
___(___, ___)

# Get the rmse of the cross-validation predictions
___(___, ___)
Edit and Run Code