Assessing out-of-sample model fit
You now know that it makes more sense to look at the out-of-sample model fit than the in-sample fit. In this exercise, you therefore want to come up with an out-of-sample accuracy measure.
Before, you will have to do some preparational steps, though. Take defaultData
again. logitModelNew
is already loaded in your environment.
Be aware that for a complete analysis you would always have to compare different model candidates also (and especially) using out-of-sample data.
The in-sample accuracy - using the optimal threshold of 0.3 - is 0.7922901
.
Make sure you understand if there is overfitting.
This exercise is part of the course
Machine Learning for Marketing Analytics in R
Exercise instructions
First, split the dataset randomly into training and test set. The training set shall contain 2/3 of the overall data.
Then, quickly run the model and call it
logitTrainNew
. Use the given formula.Make predictions on the test set and then calculate the out-of-sample accuracy with the help of a confusion matrix. Note that
SDMTools
cannot be downloaded from CRAN anymore. For your personal computer install it instead viaremotes::install_version("SDMTools", "1.1-221.2")
.Compare the out-of-sample accuracy to the in-sample value, given above.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Split data in train and test set
set.seed(534381)
defaultData$isTrain <- rbinom(nrow(defaultData), 1, 0.66)
train <- subset(defaultData, ___ == 1)
test <- subset(defaultData, ___ == 0)
logitTrainNew <- glm(formulaLogit, family = binomial, data = ___) # Modeling
test$predNew <- predict(logitTrainNew, type = "response", newdata = ___) # Predictions
# Out-of-sample confusion matrix and accuracy
confMatrixModelNew <- confusion.matrix(___, ___, threshold = 0.3)
sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew) # Compare this value to the in-sample accuracy