1. Discretization of continuous variables
The first step of creating predictor insight graphs is to discretize the continuous variables.
2. Discretization in python
In python, you can easily discretize pandas columns using the `qcut` method. Assume that you want to divide the variable maximum gift in three bins of equal size. Then you can use the `qcut` method with two arguments. The first argument is the column containing the variable that you want to discretize, and the second argument is the number of bins you want to obtain. The output of the `qcut` method is a new column that can be added to the basetable.
You can check that the `qcut` method divides the variable in equal size bins by using the groupby and size methods.
3. Which variables should be discretized
In many cases, there are many variables to discuss. Instead of discretizing the continuous variables one by one, you can write a procedure that does the job for you. First you need to determine the which variables are continuous, and then you can loop over these continuous variables and discretize them.
Assume that the variables in the model are given in a list `variables_model`. Variables like income_average and gender_M only take two values, so it does not make sense to discretize these values. You can write a function check_discretize that determines whether a variable should be discretized or not. This function counts the number of distinct values in a variable, and returns true if this number is higher than a pre-set threshold and false else.
4. Discretization of all variables
With this function, you can loop through all the variables and only discretize those variables that need to be discretized. For each variable in the model, you can check whether it should be discretized. If this is the case, you can create a new variable that has the same name with prefix disc that is now a discretized version of the original variable.
5. Clean cuts
The qcut method divides the variable in equal size bins. However, this sometimes results in ugly intervals. Consider for instance the variable age that is discretized in 5 bins. The first bin starts at 38 and ends at 49, but it would be clearer if it would start at 40 and end at 50, and the same for the other bins. In python, you can specify the cuts that you like using the cut function. This function takes the column that you want to discretize as first argument, and the cutting points as second argument. You could for instance bin the age variable using the cleaner cuts 30, 40, 50 and 60. The resulting discretized variable does not have equal size bins, but the result is much easier to read and interpret.
A good strategy is to first use qcut to get a rough idea of how the bins should be choosen, and then clean up the cuts using the cut function.
6. Let's practice!
Let's practice discretization!