Motivation for variable selection

1. Variable selection

In the previous chapter, you learned how to build a logistic regression model that predicts which candidate donors are most likely to donate. You built models with a fixed set of variables. In this chapter, you will learn how to wisely select a set of variables for your model.

2. Candidate predictors

Until now, we considered a basetable with only 3 candidate predictors: age, maximum gift, and income. In many applications, however, many more variables are available. Companies often store more than one thousand variables that describe their customers. Also in our application, predicting donations, many more variables can be created. It requires some creativity from the data scientist to come up with interesting candidate predictors. For instance, you could also consider the minimum gift, the mean gift and the median gift of the donor. You could add information about the living place of the donor, count the number of gifts larger than a certain amount, and so on.

3. Variable selection: motivation

Once you collected all the candidate predictors, you can build a logistic regression model. However, you should be careful not adding too many predictors. Indeed, there are some drawbacks related to models that use many variables. First of all, it should be mentioned that models with many variables are not necessarily better models. This is due to a phenomenon called over-fitting that we discuss later in more detail. Secondly, models with many variables are harder to maintain. Often, models are used on a monthly, weekly or even on a daily basis. It is a good idea to limit the running time of the models and the time needed to create the variables by only using the significant variables. Finally, models with abundant variables are hard to interpret. Recall from the previous chapter that you interpreted the coefficients of the models to make sure they make sense. This is difficult if the set of variables is too large. Moreover, in case of many variables, chances are high that the variables are correlated which makes interpretation even harder or impossible.

4. Model evaluation: AUC

The goal of variable selection is to select a set of variables that has optimal performance. A measure often used to quantify the performance of predictive models is the AUC value. It is a number between 0 and 1 that expresses how well the model can order the objects from low chance to be a target to high chance to be a target. Perfect models have an AUC of 1, whereas random models have an AUC of about 0-point-5. In Python, the AUC can easily be calculated using the function roc_auc_score from the sklearn package. It takes two arguments: the true value - 0 if it is a target and 1 otherwise, and the probability to be target calculated by the model. These arguments should be formatted as arrays from the numpy package.

5. Let's practice!

Now it's your turn. Let's find out the AUC of the models constructed in the previous chapter, and compare it with models that have other variables.