Get startedGet started for free

Assessing out-of-sample model fit

You now know that it makes more sense to look at the out-of-sample model fit than the in-sample fit. In this exercise, you therefore want to come up with an out-of-sample accuracy measure.

Before, you will have to do some preparational steps, though. Take defaultData again. logitModelNew is already loaded in your environment.

Be aware that for a complete analysis you would always have to compare different model candidates also (and especially) using out-of-sample data.

The in-sample accuracy - using the optimal threshold of 0.3 - is 0.7922901. Make sure you understand if there is overfitting.

This exercise is part of the course

Machine Learning for Marketing Analytics in R

View Course

Exercise instructions

  • First, split the dataset randomly into training and test set. The training set shall contain 2/3 of the overall data.

  • Then, quickly run the model and call it logitTrainNew. Use the given formula.

  • Make predictions on the test set and then calculate the out-of-sample accuracy with the help of a confusion matrix. Note that SDMTools cannot be downloaded from CRAN anymore. For your personal computer install it instead via remotes::install_version("SDMTools", "1.1-221.2").

  • Compare the out-of-sample accuracy to the in-sample value, given above.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Split data in train and test set
set.seed(534381) 
defaultData$isTrain <- rbinom(nrow(defaultData), 1, 0.66)
train <- subset(defaultData, ___ == 1)
test <- subset(defaultData, ___  == 0)

logitTrainNew <- glm(formulaLogit, family = binomial, data = ___) # Modeling
test$predNew <- predict(logitTrainNew, type = "response", newdata = ___) # Predictions

# Out-of-sample confusion matrix and accuracy
confMatrixModelNew <- confusion.matrix(___, ___, threshold = 0.3) 
sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew) # Compare this value to the in-sample accuracy
Edit and Run Code