Tree-based imputation

1. Tree-based imputation

You already know how to use statistical models to impute missing values. Let's move on to discuss a different kind of model.

2. Tree-based imputation approach

Traditional statistical models can be replaced by machine learning models to predict missing values. This approach has its advantages: it doesn't assume specific relationships between variables and can pick up complex, non-linear patterns, which often leads to a better predictive performance. In this course, we will use the "missForest" package, which uses "randomForest" behind the hood.

3. Decision trees

Let's start with a quick intro to decision trees, the basis of random forests. Decision trees are built by repeatedly splitting the data into subsets based on the values of selected variables. These splits are chosen such that the target variable in each subset is as homogeneous as possible. To make a prediction on a new observation, we place it in the appropriate subset based on the values of its variables and predict the mean of this subset.

4. Random forests

Random forests are collections of multiple trees. Each tree is based on data sampled with replacement from the original data (procedure known as bagging) and only some random selection of variables are used for splitting. The results from all trees are aggregated at the end.

5. missForest algorithm

Now that you know how random forests work, let's go through the missForest algorithm. First, we make an initial guess for the missing values using mean imputation. Second, we sort the variables in ascending order by the amount of missing values. Then, for each variable, the missing values are imputed by first fitting a random forest to its observed part (using other variables as predictors) and then using it to predict the missing part. The procedure is repeated until the imputed values do not change much anymore. This is very similar to the model-based approach which we discussed in the last two lessons.

6. missForest in practice

Let's see how missForest works in practice. First, let's see how many missing values there are for each column in nhanes data. To impute them, we load the "missForest" package and call the function also called "missForest" on the incomplete data frame. This yields a list of imputation results, which we call "imp_res". We have to extract the imputed data with "$ximp" notation. A quick check shows there are indeed no missing values in the imputed data.

7. Imputation error

In order to assess the quality of imputation, we can look at the out-of-bag estimate of imputation errors based on the underlying random forests. It is defined as the normalized root mean squared error for continuous variables and the percentage of incorrect classifications for categorical variables. In both cases, good performance leads to a value close to 0 and values around 1 indicate a poor result. The errors can be extracted from missForest's output with "$OOBerror" notation.

8. Imputation error

By default, you get one error measure per variable type. You can also get per-variable errors by running missForest with the "variablewise" argument set to TRUE.

9. Speed-accuracy trade-off

Growing multiple random forests can be time-consuming, especially for large data sets. To decrease computation time, we can sacrifice some accuracy and reduce the forest size: either by growing fewer trees (set with "ntree" argument), or by using fewer variables for splitting (set with "mtry" argument). The two methods have a different effect on computation time. The number of trees has a linear effect: reducing it twice will result in double the speed. Reductions in "mtry", on the other hand, increase speed more when there are many variables in the data.

10. Speed-accuracy trade-off in practice

Let's see it in practice. First, let's time the run with default settings, i.e. 100 trees and a square root of the number of variables considered for splits. Now, compare it with the run with only 10 trees and "mtry" equal to 2. Notice how the computation time went down, but the estimated error increased.

11. Let's practice!

Let's practice what you've learned!

This exercise is part of the course

Handling Missing Data with Imputations in R

AdvancedSkill Level

4.8+

Start Course for Free

In this chapter, you’ll find out why missing data can be a risk when analyzing a dataset. You’ll be introduced to the three missing data mechanisms and learn how to recognize them using statistical tests and visualization tools.

Exercise 1: Missing data: what can go wrong Exercise 2: Linear regression with incomplete data Exercise 3: Analyzing regression output Exercise 4: Comparing models Exercise 5: Missing data mechanisms Exercise 6: Recognizing missing data mechanisms Exercise 7: t-test for MAR: data preparation Exercise 8: t-test for MAR: interpretation Exercise 9: Visualizing missing data patterns Exercise 10: Aggregation plot Exercise 11: Spine plot Exercise 12: Mosaic plot

Get to know the taxonomy of imputation methods and learn three donor-based techniques: mean, hot-deck, and k-Nearest-Neighbors imputation. You’ll look under the hood to see how these methods work, before learning how to apply them to a real-world tropical weather dataset. Along the way, you’ll also learn useful tricks that you can use to make them work even better for your problems.

Exercise 1: Mean imputation Exercise 2: Smelling the danger of mean imputation Exercise 3: Mean-imputing the temperature Exercise 4: Assessing imputation quality with margin plot Exercise 5: Hot-deck imputation Exercise 6: Vanilla hot-deck Exercise 7: Hot-deck tricks & tips I: imputing within domains Exercise 8: Hot-deck tricks & tips II: sorting by correlated variables Exercise 9: k-Nearest-Neighbors imputation Exercise 10: Choosing the number of neighbors Exercise 11: kNN tricks & tips I: weighting donors Exercise 12: kNN tricks & tips II: sorting variables

It’s time to learn how to use statistical and machine learning models, such as linear regression, logistic regression, and random forests, to impute missing data. In this chapter, you’ll look into how the models make their predictions and use this knowledge to draw the imputed values from conditional distributions. This is important as it ensures your imputations are more varied and plausible, making them more similar to the true data.

Exercise 1: Model-based imputation approach Exercise 2: Linear regression imputation Exercise 3: Initializing missing values & iterating over variables Exercise 4: Detecting convergence Exercise 5: Replicating data variability Exercise 6: Logistic regression imputation Exercise 7: Drawing from conditional distribution Exercise 8: Model-based imputation with multiple variable types Exercise 9: Tree-based imputation

Current Exercise

Exercise 10: Imputing with random forests Exercise 11: Variable-wise imputation errors Exercise 12: Speed-accuracy trade-off

Imputed values are not set in stone. They are just estimates and estimates come with some uncertainty. In this final chapter, you’ll discover how bootstrapping and chained equation using the mice package can be used to incorporate imputation uncertainty into your models and analyses to make them more reliable and robust.

Exercise 1: Multiple imputation by bootstrapping Exercise 2: Wrapping imputation & modeling in a function Exercise 3: Running the bootstrap Exercise 4: Bootstrapping confidence intervals Exercise 5: Multiple imputation by chained equations Exercise 6: The mice flow: mice - with - pool Exercise 7: Choosing default models Exercise 8: Using predictor matrix Exercise 9: Putting it all together Exercise 10: Analyzing missing data patterns Exercise 11: Imputing and inspecting outcomes Exercise 12: Inference with imputed data Exercise 13: Final remarks