Evaluate a modeling procedure using n-fold cross-validation
In this exercise, you will use splitPlan
, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg$cty
from mpg$hwy
.
If dframe
is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:
# Initialize a column of the appropriate length
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.
The data frame mpg
, the cross validation plan splitPlan
, and the rmse()
function have been pre-loaded.
This exercise is part of the course
Supervised Learning in R: Regression
Exercise instructions
- Run the 3-fold cross validation plan from
splitPlan
and put the predictions in the columnmpg$pred.cv
.- Use
lm()
and the formulacty ~ hwy
.
- Use
- Create a linear regression model on all the
mpg
data (formulacty ~ hwy
) and assign the predictions tompg$pred
. - Use
rmse()
to get the root mean squared error of the predictions from the full model (mpg$pred
). Recall thatrmse()
takes two arguments, the predicted values, and the actual outcome. - Get the root mean squared error of the cross-validation predictions. Are the two values about the same?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# mpg is available
summary(mpg)
# splitPlan is available
str(splitPlan)
# Run the 3-fold cross validation plan from splitPlan
k <- ___ # Number of folds
mpg$pred.cv <- 0
for(i in ___) {
split <- ___
model <- lm(___, data = ___)
mpg$pred.cv[___] <- predict(___, newdata = ___)
}
# Predict from a full model
mpg$pred <- ___(___(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
___(___, ___)
# Get the rmse of the cross-validation predictions
___(___, ___)