Model-based imputation approach

1. Model-based imputation approach

Welcome to chapter 3. In this chapter, you will learn how to employ statistical and machine learning models for imputing missing data.

2. Model-based imputation

The idea behind the model-based imputation is to impute each variable with a different statistical model. This is a great approach when we have some prior knowledge about the relations between the variables and we can take it into account when building the models.

3. Model-based imputation procedure

The general idea is to loop over the variables and for each of them create a model that explains it using the remaining variables. This model is used to predict the missing values. We then iterate through the variables multiple times, imputing the locations where the data were originally missing. Let's see what this means on an example.

4. Model-based imputation step by step

Imagine we have a data frame with four variables: A, B, C and D, two of which (A and C) have missing values that we would like to impute.

5. Model-based imputation step by step

First, we build a model that predicts missing values in A using B, C and D. You can see the NAs in A replaced with imputed numbers in the table.

6. Model-based imputation step by step

Then, we treat the imputed values in A as though they were observed data and we build a model that predicts missing values in C using A, B and D.

7. Model-based imputation step by step

Now that we have looped over both incomplete variables, we start again. We set the values in A that were originally missing to NAs and again build a model to predict them. This model will now use the values we have just imputed in C. And so we repeat this process a couple of times. Each time, the new imputed data will be more and more similar to the previous imputation. At some point they won't change anymore, indicating that the process has converged and we can stop. To sum it all up: we mark the locations where data are originally missing and repeatedly impute them one variable at a time using imputed data in other variables.

8. How to choose the model

Now, how to choose a model for each of the variables? This depends on the type of the variable to be imputed. For continuous variables, a popular choice is linear regression. For binary variables, it's logistic regression. For categorical variables, multinomial logistic regression is used and for counts - Poisson regression is a good choice. Let's take a closer look at how to impute continuous variables with linear regression.

9. Single linear regression imputation

We load the package called "simputation". It has a function "impute_lm" that performs a linear regression imputation according to the formula we pass. The left hand side of the formula are the variables to be imputed separated with pluses, while the right hand side are the variables to be used as predictors in the model. A dot stands for all variables in the data frame. Here, we impute Height and Weight using all other variables. Let's check if they were indeed imputed. It turns out there are still some missing values remaining! This is because the linear regression model cannot predict cases where at least one of the predictors is missing, which could be any of the incomplete variables. To fix it, we will have to initialize the missing values somehow. Also, a single imputation is usually not enough. It is based on the basic initialized values and could be biased. A proper approach is to iterate over the variables multiple times, as we have discussed before. Let's see how to do it.

10. Linear regression imputation in practice

First, we initialize the missing values with the hotdeck imputation. We save the locations of missing values of height and weight using the boolean indicators created by the hotdeck function, which you saw in the previous chapter. Then, we iterate over the variables 5 times. In each iteration, we set height to NA where it was originally missing and impute it with the impute_lm function, using age, gender and weight as predictors. Then we do the same for weight, including the freshly imputed height in the predictors.

11. Detecting convergence

How do we know that 5 iterations are enough? We don't upfront, but we can check it afterwards. For each iteration, we can calculate how much the newly imputed variable differs from the previous imputation. Let's see how to do it. This is the same for-loop you have seen in the last slide and we will extend it slightly.

12. Detecting convergence

Before the loop, we create two empty vectors to store differences across iterations in each of the two variables.

13. Detecting convergence

At the start of each iteration, we copy the data imputed in the previous step, or just initialized in case of the first iteration, to the variable "prev_iter".

14. Detecting convergence

Finally, we append the mean absolute percentage change between the current and previous imputation of each variable, computed with the mapc function, to the corresponding vectors. You will see exactly what the mapc function is doing in the exercises.

15. Detecting convergence

If we were to plot the mapc for height and weight, we could see that in this particular case, one iteration was enough to converge, as the values don't change after that.

16. Let's practice linear regression imputation!

Let's practice linear regression imputation!

This exercise is part of the course

Handling Missing Data with Imputations in R

AdvancedSkill Level

4.8+

Start Course for Free

In this chapter, you’ll find out why missing data can be a risk when analyzing a dataset. You’ll be introduced to the three missing data mechanisms and learn how to recognize them using statistical tests and visualization tools.

Exercise 1: Missing data: what can go wrong Exercise 2: Linear regression with incomplete data Exercise 3: Analyzing regression output Exercise 4: Comparing models Exercise 5: Missing data mechanisms Exercise 6: Recognizing missing data mechanisms Exercise 7: t-test for MAR: data preparation Exercise 8: t-test for MAR: interpretation Exercise 9: Visualizing missing data patterns Exercise 10: Aggregation plot Exercise 11: Spine plot Exercise 12: Mosaic plot

Get to know the taxonomy of imputation methods and learn three donor-based techniques: mean, hot-deck, and k-Nearest-Neighbors imputation. You’ll look under the hood to see how these methods work, before learning how to apply them to a real-world tropical weather dataset. Along the way, you’ll also learn useful tricks that you can use to make them work even better for your problems.

Exercise 1: Mean imputation Exercise 2: Smelling the danger of mean imputation Exercise 3: Mean-imputing the temperature Exercise 4: Assessing imputation quality with margin plot Exercise 5: Hot-deck imputation Exercise 6: Vanilla hot-deck Exercise 7: Hot-deck tricks & tips I: imputing within domains Exercise 8: Hot-deck tricks & tips II: sorting by correlated variables Exercise 9: k-Nearest-Neighbors imputation Exercise 10: Choosing the number of neighbors Exercise 11: kNN tricks & tips I: weighting donors Exercise 12: kNN tricks & tips II: sorting variables

It’s time to learn how to use statistical and machine learning models, such as linear regression, logistic regression, and random forests, to impute missing data. In this chapter, you’ll look into how the models make their predictions and use this knowledge to draw the imputed values from conditional distributions. This is important as it ensures your imputations are more varied and plausible, making them more similar to the true data.

Exercise 1: Model-based imputation approach

Current Exercise

Exercise 2: Linear regression imputation Exercise 3: Initializing missing values & iterating over variables Exercise 4: Detecting convergence Exercise 5: Replicating data variability Exercise 6: Logistic regression imputation Exercise 7: Drawing from conditional distribution Exercise 8: Model-based imputation with multiple variable types Exercise 9: Tree-based imputation Exercise 10: Imputing with random forests Exercise 11: Variable-wise imputation errors Exercise 12: Speed-accuracy trade-off

Imputed values are not set in stone. They are just estimates and estimates come with some uncertainty. In this final chapter, you’ll discover how bootstrapping and chained equation using the mice package can be used to incorporate imputation uncertainty into your models and analyses to make them more reliable and robust.

Exercise 1: Multiple imputation by bootstrapping Exercise 2: Wrapping imputation & modeling in a function Exercise 3: Running the bootstrap Exercise 4: Bootstrapping confidence intervals Exercise 5: Multiple imputation by chained equations Exercise 6: The mice flow: mice - with - pool Exercise 7: Choosing default models Exercise 8: Using predictor matrix Exercise 9: Putting it all together Exercise 10: Analyzing missing data patterns Exercise 11: Imputing and inspecting outcomes Exercise 12: Inference with imputed data Exercise 13: Final remarks