Get Started

Forward stepwise variable selection

1. Forward stepwise variable selection

In this section, you will learn a very intuitive approach to variable selection, namely a forward stepwise procedure.

2. The forward stepwise variable selection procedure

As its name implies, the method proceeds in several steps. First, it selects among all candidate predictors the variable that has the best AUC when used in a model. Next, it selects another candidate predictor that has the best AUC in combination with the first selected variable. This scheme continues until you added all candidate predictors, or until you obtain a predefined number of variables.

3. Functions in Python

As the forward stepwise procedure is not trivial and a lot of code is repeated in the procedure, it is most efficient to use Python functions to implement it. In short, a Python function is a block of code that can easily be re-used. A function block starts with the keyword `def`, followed by the name of the function, parentheses and a colon. Between the parentheses, arguments separated by a comma can be provided, serving as input for the function. The function block consists of several lines of code, and a return statement with the output of the function. The function can then be called, by filling out the arguments between parentheses.

4. Implementation of the forward stepwise procedure

Implementing the forward stepwise procedure goes smoothly if you split the task in smaller chunks. First, you implement a function that returns the AUC of a model built with a given variable set. Second, you code a function that returns the next best variable to add. Finally, you use these functions to add variables in a stepwise manner until you reach the desired variable set.

5. Implementation of the AUC function

In the previous exercises you learned how to calculate the AUC of a model. All you need to do is to wrap this code in a function that takes the variables, the target and the basetable as an argument. This function defines the X and y variables that serve as input to the logistic regression fit function. Next, it makes predictions and finally calculates and returns the AUC. Now, if you want to know the AUC of a model using the variables age and gender_F, you can simply run the AUC function with these variables as input.

6. Calculating the next best variable

The next_best function needs to know which variables are currently in the model, the candidate variables, the target and the basetable. The function loops through all candidate variables and keeps track of which variable is best, and the auc associated with this best variable. Therefore, you first need to initialize the best auc and the best variable. For each variable in the candidate variable set, you calculate the AUC. The set of variables is the set with the variables that are already in the model, extended with the variable that you want to evaluate. If this AUC is better than the best AUC found so far, you change the best auc and best variable. If you want to know which variable among mean_gift, max_gift and min_gift should be added next, given that age and gender_F are already in the model, you can use the next_best function as follows.

7. The forward stepwise variable selection procedure

To complete the implementation of the forward stepwise procedure, you first initialize the candidate variable list, and the current variable list, which keeps track of the variables added to the model so far. You can indicate the maximum number of variables that can be added. In each iteration, the next best variable is calculated using the next_best function. The current_variables list is updated by adding the chosen variable, and the chosen variable is removed from the candidate variable list.

8. Let's practice!

Now it's your turn: let's see if you can obtain a good set of variables.