Input selection based on the AUC

1. Input selection based on the AUC

In the previous exercises, you have seen how you can construct ROC curves for each of the models you have fitted. Let's have a look again at the ROC curves for the logistic regression models.

2. ROC curves for 4 logistic regression models

We looked at four logistic regression models:

3. ROC curves for 4 logistic regression models

three models, including four variables and each time another link function (logit, probit, and cloglog),

4. ROC curves for 4 logistic regression models

and one with all seven variables using a logit link. Looking at the ROC curves, it seems that, for the data we used, the link function does not have a big impact on the ROC when including the same variables. The ROC curve clearly improves, however, when going from 4 variables to 7 in the model. Looking at it this way, the ROC curve (or, more specifically, the AUC) could be used for variable selection here. In fact, banks are particularly interested in knowing which variables are important for predicting default. As discussed previously and shown here using the logit model, including all parameters, you could also look at individual p-values of the parameter estimates to see which variables are more or less important. In regression models in general, stepwise procedures taking out or putting in variables and evaluating their p-values is a popular way of performing variable selection. However, if you are evaluating this model based on classification performance, it might be worth looking at the AUC performance of this model for variable selection rather than the p-values.

5. AUC-based pruning

This former method is also referred to as AUC-based pruning. Using this method, you start with the model, including all variables and compute the AUC for this model.

6. AUC-based pruning

In a second step, you fit all possible models, excluding just one of the variables at a time, which leads to 7 models for our data set. For each of these models, you make the probability of default predictions as well.

7. AUC-based pruning

In a third step, you compute the AUC for each of these models using these predictions, and after that, you will retain the model with the highest AUC. In this case, this would lead to the deletion of the homeownership variable in the logistic regression model. Note that by deleting this variable, the AUC goes up from point-6512 to point-6537. The AUC-based pruning continues until the AUC decreases. Note that in some cases, it might even be worth allowing the AUC to decrease slightly when deleting a variable from the model, as the model becomes less complex, which can improve interpretation. A similar method could be used for decision trees, but it is less straightforward, since other factors come into play; for example, how a tree is pruned, and which variables are chosen for a certain split.

8. Let's practice!

Now, you will see for yourself how we can further simplify the logistic regression model we just discussed using AUC-based pruning!