1. Dummy variables, missing data, and interactions
All of the predictors used in a regression analysis must be numeric. This means that all categorical data must be represented as a number. Missing data also poses a problem, as the empty value cannot be used to make predictions.
In this video, you will learn tips for preparing these types of data to be used in a logistic regression model. You will also learn about how to model the interactions among predictors an important step in building more powerful predictive models.
2. Dummy coding categorical data
In chapter 1, you learned about dummy coding, which creates a set of binary (one-zero) variables that represent each category except one that serves as the reference group.
Dummy coding is the most common method for handling categorical data for logistic regression. The glm function will automatically dummy code any factor type variables used in the model. Simply apply the factor function to the data as in the example here.
Keep in mind that you may run into a case where a categorical feature is represented with numbers, such as 1, 2, 3 for 'hot', 'warm', and 'cold'. Even in the case, it may be advisable to convert this to a factor. This allows each category to have a unique impact on the outcome.
3. Imputing missing data
By default, the regression model will exclude any observation with missing values on its predictors. This may not be a big deal for small amounts of missing data, but can very quickly become a much larger problem.
With categorical missing data, a missing value can be treated like any other category. You might construct categories for male, female, other, and missing.
When a numeric value is missing, the solution is less clear.
One potential solution uses a technique called imputation. This fills, or imputes, the missing value with a guess about what the value may be. A very simple strategy is called mean imputation, which as you might expect, imputes the average.
Because records having missing data may differ systematically from those without, a binary 1/0 missing value indicator can be added to model the fact that a value was imputed. Sometimes, this becomes one of the model's most important predictors!
It is important to note that although this strategy is OK for simple predictive models, it is not appropriate for every regression application. More sophisticated forms of imputation use models to predict the missing data based on the non-missing values.
4. Interaction effects
An interaction effect considers the fact that two predictors, when combined, may have a different impact on an outcome than the sum of their separate individual impacts. Their combination may strengthen, weaken, or completely eliminate the impact of the individual predictors.
For example, obesity and smoking are both known to be harmful to one's health, but put together they may be even more harmful. Alternatively, two predictors may be harmful when applied separately, but when combined they neutralize, suppress, or nullify each other.
Being able to model these combinations is important for creating the best predictive models. As illustrated here, the R formula interface uses the multiplication symbol to create an interaction between two predictors. The resulting model will include terms for each of the individual components as well as the combined effect.
5. Let's practice!
In the next series of exercises, you will apply dummy coding, missing value imputation, and interaction effects to build a stronger donation model.