1. Out-of-sample validation and cross validation
Welcome back! Let's look at out-of-sample validation and cross validation in order to avoid overfitting. We will start with out-of-sample validation.
2. Out-of-sample fit: training and test data
First, randomly split your data into two parts. One should comprise about two thirds of the original dataset. This is the training set on which the model is specified.
The other set is about one third of the original dataset and serves as test set, where the goodness of fit measures are calculated.
When working with random numbers, we set a seed at the beginning of the analysis in order to ensure reproducibility.
Then, we partition the data using a vector of 0s and 1s generated by the `rbinom()` function from the `stats` package. The 1s will indicate observations for the training set. Hence, I set the `prob` argument to 0.66.
Now we use this newly created vector in order to draw subsets of the original dataset. Observations with 1s in the random vector become observations for the training set, observations with 0s are assigned to the test set.
3. Out-of-sample fit: building model
We specify the model `logitTrainNew` only on the training set. Then we hand the coefficients of the model to the `predict` function in order to make predictions. By setting the argument `type` equals `response`, we predict the probabilities of a person churning. In order to predict with the out-of-sample data, we set `newdata = test`. This way the coefficients of the fitted model are used for predictions on the new test dataset.
4. Out-of-sample accuracy
Let's focus on the out-of-sample accuracy - even though we could use any goodness of fit measure for out-of-sample validation. We are using the out-of-sample predictions `test$predNew` to calculate the confusion matrix like before. I redid the thresholding procedure only using the training data. 0.3 is still optimal!
The out-of-sample accuracy is calculated by setting the number of observations on the main diagonal of the matrix in relation to the total number of observations. With 79.5% it is not different from the in-sample measure which we calculated on the training set. This shows that the model does not overfit the training data.
5. Cross-validation: setup
Cross-validation is an even better tool for preventing overfitting since it needs less data than out-of-sample validation.
This graphic shows a 4-fold cross validation procedure:
First, split your dataset randomly into 4 subsets. Three of these subsets are used as training data. The subset that's left (called the test data) is used to calculate the model goodness-of-fit measures. Then you perform 3 repetitions, and each time a different subset is used as test data and the remaining 3 subsets as training data. Finally you calculate the average goodness of fit measures across the four different outputs.
6. Cross-validation: accuracy
The `cv.glm()` function from the `boot` package allows you to implement cross-validation for linear models. We specify the observed responses, the predicted responses and a cost argument. This could be any goodness of fit measure. In this case we focus on the accuracy, hence, we first have to specify a function that returns the accuracy of classifications based on a threshold of 0.3.
Here again, the accuracy is pretty much the same as the in-sample estimate.
Usually, analyses do not stop here. A way to continue is to change the model slightly, for example, add or remove some variables, compute the cross-validated model fit statistics and use them to compare the respective models.
7. Learnings and relevance
Good job! You made it through the whole chapter! Check out what you learned.
8. Last exercise!
Ok, get ready for the last exercise about logistic regression!