Logistic regression imputation

A popular choice for imputing binary variables is logistic regression. Unfortunately, there is no function similar to impute_lm() that would do it. That's why you'll write such a function yourself!

Let's call the function impute_logreg(). Its first argument will be a data frame df, whose missing values have been initialized and only containing missing values in the column to be imputed. The second argument will be a formula for the logistic regression model.

The function will do the following:

  • Keep the locations of missing values.
  • Build the model.
  • Make predictions.
  • Replace missing values with predictions.

Don't worry about the line creating imp_var - this is just a way to extract the name of the column to impute from the formula. Let's do some functional programming!

This exercise is part of the course

Handling Missing Data with Imputations in R

View Course

Exercise instructions

  • Create a boolean mask for where df[imp_var] is missing and assign it to missing_imp_var.
  • Fit a logistic regression model using the formula and data that the function will get as arguments, while remembering to set the correct family to ensure a logistic regression is fit (pass it without quotation marks); assign the model to logreg_model.
  • Predict the response with the model and assign it to preds; remember to set the appropriate prediction type.
  • Use preds alongside missing_imp_var to impute missing values.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

impute_logreg <- function(df, formula) {
  # Extract name of response variable
  imp_var <- as.character(formula[2])
  # Save locations where the response is missing
  missing_imp_var <- ___
  # Fit logistic regression mode
  logreg_model <- ___(___, data = ___, family = ___)
  # Predict the response and convert it to 0s and 1s
  preds <- predict(___, type = ___)
  preds <- ifelse(preds >= 0.5, 1, 0)
  # Impute missing values with predictions
  df[missing_imp_var, imp_var] <-___[___]
  return(df)
}