1. Practicing imputing with different models
When you develop imputation models, it is a good idea to try out a few different models, to see how the imputed values change according to your assumptions. In this lesson, we are going to impute data using linear regression.
2. Lesson overview
There are many imputation packages in R.
We are going to focus on using the simputation package by Mark van der Loo. simputation provides a simple, powerful interface to many imputation models.
We will impute values using a linear model, using impute_lm.
Building a good imputation model is super important, but it is a complex topic - there is as much to building a good imputation model as there is for building a good statistical model.
We focus on how to build up different imputation models and assess and compare them.
3. How imputing using a linear model works
We previously explored using mean imputation.
This is generally a bad imputation method to use, as it artificially increases the mean and reduces variance.
Similar to how the mean was imputed, we can use another function to impute data. For example, a linear model.
This can take into account some features of the data, to better predict missing values.
To impute values using a linear model,
we can impute data using impute_lm from simputation.
Here we specify the variable that we would like to impute as the y, on the left hand side of the formula, and the variables we would like to use to inform the imputations on the right hand side.
This returns a dataframe with imputed values in y, noted in y_NA.
4. Using impute_lm
Using airquality data, we can impute the values in Solar-dot-R using Wind, Temp, and Month, and chain another imputation step in to impute Ozone with the same variables. This gives us imputations like the following on the right.
5. Tracking missing values
An important part of imputing data is using the bind_shadow and add_label_shadow functions.
Without them, we can't identify which values were missing!
bind_shadow adds the variables with _NA to the data, and add_label_shadow adds a separate label with "Missing" or "Not Missing".
We can use ggplot to show the imputed values, by setting color equals any_missing in a ggplot.
6. Evaluating imputations: evaluating and comparing imputations
When you build up an imputation model, it is good practice to compare it to an alternative method.
Let's compare two linear regression imputation models, one with two variables - Wind, and Temperature, the other with four, Wind, Temperature, Month, and Day.
7. Evaluating imputations: binding and visualizing many models
To compare models, we bind them together using bind_rows from the dplyr package, and give them names. So, small equals aq_imp_small and large equals aq_imp_large.
We can then use dot-id equals "imp_model", This creates a dataset of all the imputations with an extra column, imp_model.
8. Evaluating imputations: exploring many imputations
We can then look at the values of Ozone and Solar Radiation on a scatter plot, coloring by any missings, and faceting by imputation model used, imp_model.
Here we see that there isn't much difference between the model imputed values.
9. Explore imputations in multiple variables and models
To explore the imputations across these different models and variables, we gather the selected four variables, Ozone, Solar Radiation, any_missing, and imp_model, and then we gather this data, making sure that we keep any_missing and imp_model outside of the gather.
This gives us the columns, variable, value, any_missing, and imp_model.
10. Explore imputations in multiple variables and models
We can then plot the data as a box plot, putting the imputation model on the x axis, value on the y axis, and faceting the different values for each variable.
There isn't much difference between the models.
11. Explore imputations in multiple variables and models
We can also only look at the imputed values by filtering any_missing to look at "Missing", and do the same plot. Again, there isn't that much difference between the models.
12. Let's practice!
Let's practice!