1. Creating dummies
In the previous two chapters, you learned how to construct a basetable from scratch. This raw version of the basetable still needs some finetuning if you want to use it to construct a predictive model. In this chapter we discuss four important topics in data preparation that help you to give the finishing touch to your basetable.
2. Motivation for creating dummy variables (1)
Recall that a logistic regression model is basically a simple formula that assigns weights to the predictive variables in the model, wrapped in a logit function. This means that the variables should be continuous, that is, they should take real numbers as values. Predictive variables like gender, country or segment therefore need to be transformed.
3. Motivation for creating dummy variables (2)
Dummy replacement is a straightforward way to do so. Instead of using the original variable, for instance gender, you can add two new variables, one for each value that this variable takes. In case of gender, there are two new dummy variables created, one that is one if gender is Male and zero otherwise, and one that is one if gender is Female and zero otherwise.
4. Preventing Multicollinearity (1)
Let's have a closer look at the new basetable with dummy variables. Is it really necessary to have both dummy variables in the basetable? The answer is no: if we know gender_F, we also know gender_M, so only one dummy variable is necessary. This phenomenon is called multicollinearity: one of the candidate predictive variables can be constructed from the other predictive variables. It is important to avoid multicollinearity in the basetable when using logistic regression, as it can result in unstable parameter estimates and makes it hard to interpret the influence of the predictive variables on the target to be predicted.
5. Preventing Multicollinearity (2)
Therefore, when creating dummy variables for a categorical variable, you should add all dummies except one.
6. Preventing Multicollinearity (3)
As an other example, assume there are three countries, USA, India and UK in the basetable.
7. Preventing Multicollinearity (4)
In that case, it is best to only include two dummy variables, for instance country_USA and country_India.
8. Adding dummy variables in Python
Creating dummy variables has been made easy in Python using the pandas method `get_dummies`. Given a basetable with a categorical variable `segment`, you can retrieve the dummy variables using this function, with the segment column as argument. By setting the `drop_first` argument `True`, one of the dummy variables is removed and multicollinearity is avoided.
These dummies can then be added to the basetable using the concat method with the original basetable and dummies in a list as first argument, and axis 1 as second argument. If you like, you can delete the original variable from the basetable.
9. Let's practice!
Now it's your turn, let's create some dummy variables.