Imputing with random forests

A machine learning approach to imputation might be both more accurate and easier to implement compared to traditional statistical models. First, it doesn't require you to specify relationships between variables. Moreover, machine learning models such as random forests are able to discover highly complex, non-linear relations and exploit them to predict missing values.

In this exercise, you will get acquainted with the missForest package, which builds a separate random forest to predict missing values for each variable, one by one. You will call the imputing function on the biographic movies data, biopics, which you have worked with earlier in the course and then extract the filled-in data as well as the estimated imputation errors.

Let's plant some random forests!

Load the missForest package.
Use missForest() to impute missing values in the biopicsdata; assign the result to imp_res.
Extract the imputed data set from imp_res, assign it to imp_data, and check if the number of missing values is indeed zero.
Extract the estimated imputation error from imp_res, assign it to imp_err, and print it to the console.

The Problem of Missing Data

Donor-Based Imputation

Model-Based Imputation

Uncertainty from Imputation

Exercicio

Imputing with random forests

Instruções