1. Deciding on the number of variables
Forward stepwise variable selection returns the order in which the variables increase the accuracy, but you still need to decide on how many variables to use.
2. Evaluating the AUC
In order to do so, you can have a look at the AUC values. The order of the variables is given in the list variables_forward. You start with an empty list of variables, add the variables in variables_forward one by one and each time calculate the AUC value.
3. Evaluating the AUC
If you plot the AUC values, you obtain a curve that typically keeps increasing.
4. Over-fitting
However, if you would use new data to evaluate the subsequent models, it does not keep increasing. Instead, it even decreases after a while. This phenomenon is called over-fitting. By adding more variables to the model, the accuracy on the data on which the model is built increases, but the true performance of the model decreases, because the complex model does not generalize to other data.
5. Detecting over-fitting
There exist smart techniques to detect and therefore prevent over-fitting. Partitioning splits the data in two sets, train and test. The model is built using the train set, and evaluated on the test set. As the test data is independent of the model, the performance on the test data is representative of the true performance of the model.
6. Partitioning
One way to partition data is by randomly dividing the data in two parts. However, when the target incidence is low, it is better to make sure that the target incidence is the same in train and test, that is, to stratify on the target.
In python this can be done using the train_test_split method in the sklearn model-selection module. You first create basetables X and Y, containing the predictive variables and target respectively. Then, you apply the train_test_split function, using these X and Y as arguments. The stratify argument indicates that the target in Y should be evenly divided over the train and test data. The test_size argument tells how large the test set should be. If it is 0.4, then the test data contains 40% of all observations, and the train data the remaining 60%.
The train_test_split method returns four values, namely a train and test set for both the predictive variables in X and the target in Y. At the end, you can concatenate the predictive variables with the target, to obtain the final train and test sets.
7. Deciding the cut-off
You can now plot the AUC curves of the subsequent models on both the train and test data. You can clearly see that the train AUC keeps increasing, but that the test AUC stabilizes and then decreases.
8. Deciding the cut-off
When deciding on how many variables to keep in the model, one should take into account that the test AUC should be as high as possible, and that the model should have the least variables possible. In this case, it is clear that the cutoff indicated by the dashed line is the best option, as all models using more variables have the same or lower test accuracy.
9. Let's practice!
Let's practice these concepts in python.