Cross-validation
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good.
Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.
This exercise is part of the course
Helsinki Open Data Science
Exercise instructions
- Define the loss function
loss_func
and compute the mean prediction error for the training data: Thehigh_use
column inalc
is the target and theprobability
column has the predictions. - Perform leave-one-out cross-validation and print out the mean prediction error for the testing data. (
nrow(alc)
gives the observation count inalc
and usingK = nrow(alc)
defines the leave-one-out method. Thecv.glm
function from the 'boot' library computes the error and stores it indelta
. See?cv.glm
for more information.) - Adjust the code: Perform 10-fold cross validation. Print out the mean prediction error for the testing data. Is the prediction error higher or lower on the testing data compared to the training data? Why?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]