Logistic regression imputation
A popular choice for imputing binary variables is logistic regression. Unfortunately, there is no function similar to impute_lm()
that would do it. That's why you'll write such a function yourself!
Let's call the function impute_logreg()
. Its first argument will be a data frame df
, whose missing values have been initialized and only containing missing values in the column to be imputed. The second argument will be a formula
for the logistic regression model.
The function will do the following:
- Keep the locations of missing values.
- Build the model.
- Make predictions.
- Replace missing values with predictions.
Don't worry about the line creating imp_var
- this is just a way to extract the name of the column to impute from the formula. Let's do some functional programming!
This exercise is part of the course
Handling Missing Data with Imputations in R
Exercise instructions
- Create a boolean mask for where
df[imp_var]
is missing and assign it tomissing_imp_var
. - Fit a logistic regression model using the formula and data that the function will get as arguments, while remembering to set the correct
family
to ensure a logistic regression is fit (pass it without quotation marks); assign the model tologreg_model
. - Predict the response with the model and assign it to
preds
; remember to set the appropriate predictiontype
. - Use
preds
alongsidemissing_imp_var
to impute missing values.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
impute_logreg <- function(df, formula) {
# Extract name of response variable
imp_var <- as.character(formula[2])
# Save locations where the response is missing
missing_imp_var <- ___
# Fit logistic regression mode
logreg_model <- ___(___, data = ___, family = ___)
# Predict the response and convert it to 0s and 1s
preds <- predict(___, type = ___)
preds <- ifelse(preds >= 0.5, 1, 0)
# Impute missing values with predictions
df[missing_imp_var, imp_var] <-___[___]
return(df)
}