Cross-validation

Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good.

Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.

This exercise is part of the course

Helsinki Open Data Science

View Course

Exercise instructions

Define the loss function loss_func and compute the mean prediction error for the training data: The high_use column in alc is the target and the probability column has the predictions.
Perform leave-one-out cross-validation and print out the mean prediction error for the testing data. (nrow(alc) gives the observation count in alc and using K = nrow(alc) defines the leave-one-out method. The cv.glm function from the 'boot' library computes the error and stores it in delta. See ?cv.glm for more information.)
Adjust the code: Perform 10-fold cross validation. Print out the mean prediction error for the testing data. Is the prediction error higher or lower on the testing data compared to the training data? Why?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# the logistic regression model m and dataset alc (with predictions) are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data


# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))

# average number of wrong predictions in the cross validation
cv$delta[1]

Edit and Run Code