1. Tree-based imputation
You already know how to use statistical models to impute missing values. Let's move on to discuss a different kind of model.
2. Tree-based imputation approach
Traditional statistical models can be replaced by machine learning models to predict missing values. This approach has its advantages: it doesn't assume specific relationships between variables and can pick up complex, non-linear patterns, which often leads to a better predictive performance. In this course, we will use the "missForest" package, which uses "randomForest" behind the hood.
3. Decision trees
Let's start with a quick intro to decision trees, the basis of random forests. Decision trees are built by repeatedly splitting the data into subsets based on the values of selected variables. These splits are chosen such that the target variable in each subset is as homogeneous as possible. To make a prediction on a new observation, we place it in the appropriate subset based on the values of its variables and predict the mean of this subset.
4. Random forests
Random forests are collections of multiple trees. Each tree is based on data sampled with replacement from the original data (procedure known as bagging) and only some random selection of variables are used for splitting. The results from all trees are aggregated at the end.
5. missForest algorithm
Now that you know how random forests work, let's go through the missForest algorithm.
First, we make an initial guess for the missing values using mean imputation.
Second, we sort the variables in ascending order by the amount of missing values.
Then, for each variable, the missing values are imputed by first fitting a random forest to its observed part (using other variables as predictors) and then using it to predict the missing part. The procedure is repeated until the imputed values do not change much anymore.
This is very similar to the model-based approach which we discussed in the last two lessons.
6. missForest in practice
Let's see how missForest works in practice. First, let's see how many missing values there are for each column in nhanes data. To impute them, we load the "missForest" package and call the function also called "missForest" on the incomplete data frame. This yields a list of imputation results, which we call "imp_res". We have to extract the imputed data with "$ximp" notation. A quick check shows there are indeed no missing values in the imputed data.
7. Imputation error
In order to assess the quality of imputation, we can look at the out-of-bag estimate of imputation errors based on the underlying random forests.
It is defined as the normalized root mean squared error for continuous variables and the percentage of incorrect classifications for categorical variables.
In both cases, good performance leads to a value close to 0 and values around 1 indicate a poor result. The errors can be extracted from missForest's output with "$OOBerror" notation.
8. Imputation error
By default, you get one error measure per variable type. You can also get per-variable errors by running missForest with the "variablewise" argument set to TRUE.
9. Speed-accuracy trade-off
Growing multiple random forests can be time-consuming, especially for large data sets. To decrease computation time, we can sacrifice some accuracy and reduce the forest size: either by growing fewer trees (set with "ntree" argument), or by using fewer variables for splitting (set with "mtry" argument). The two methods have a different effect on computation time. The number of trees has a linear effect: reducing it twice will result in double the speed. Reductions in "mtry", on the other hand, increase speed more when there are many variables in the data.
10. Speed-accuracy trade-off in practice
Let's see it in practice. First, let's time the run with default settings, i.e. 100 trees and a square root of the number of variables considered for splits. Now, compare it with the run with only 10 trees and "mtry" equal to 2. Notice how the computation time went down, but the estimated error increased.
11. Let's practice!
Let's practice what you've learned!