Imputing with random forests

A machine learning approach to imputation might be both more accurate and easier to implement compared to traditional statistical models. First, it doesn't require you to specify relationships between variables. Moreover, machine learning models such as random forests are able to discover highly complex, non-linear relations and exploit them to predict missing values.

In this exercise, you will get acquainted with the missForest package, which builds a separate random forest to predict missing values for each variable, one by one. You will call the imputing function on the biographic movies data, biopics, which you have worked with earlier in the course and then extract the filled-in data as well as the estimated imputation errors.

Let's plant some random forests!

This exercise is part of the course

Handling Missing Data with Imputations in R

View Course

Exercise instructions

  • Load the missForest package.
  • Use missForest() to impute missing values in the biopicsdata; assign the result to imp_res.
  • Extract the imputed data set from imp_res, assign it to imp_data, and check if the number of missing values is indeed zero.
  • Extract the estimated imputation error from imp_res, assign it to imp_err, and print it to the console.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load the missForest package
___

# Impute biopics data using missForest
imp_res <- ___(___)

# Extract imputed data and check for missing values
imp_data <- imp_res$___
print(___(___(___)))

# Extract and print imputation errors
imp_err <- imp_res$___
print(___)