Final remarks

1. Final remarks

Congratulations on making it until the end of the course! Let's recap what you have learned.

2. What you know

In Chapter 1, you saw how modeling incomplete data can be troublesome and that it requires a special treatment. You also learned about the three missing data mechanisms and how to gain insights into missing data patterns by using visualizations from the VIM package and statistical tests. In Chapter 2, we covered donor-based imputation. First, you've seen why mean imputation is typically a poor choice. Then, you learned to use hot-deck and kNN imputation from the VIM package, along with some tricks that make them work even better.

3. What you know

In Chapter 3, you learned the model-based imputation approach of looping over variables and imputing them until convergence. You've also seen how to increase variability of imputed data by drawing from conditional distributions. Finally, you've learned about tree-based imputation with the missForest package. In Chapter 4, you've seen two methods of incorporating the uncertainty from imputation into modeling: bootstrapping using the boot package and multiple imputation by chained equations using the mice package.

4. Which imputation method to choose?

You learned about a lot of different imputation methods. Which one to choose and when? Here are some loose guidelines. If you have a lot of data or if your imputation has to run in real-time in production, you're best-off with the quick hot-deck. If you suspect specific relations between the variables based on domain knowledge, you can use this knowledge in model-based imputation. Otherwise, if the imputation need not be very fast and the relations between the variables are not obvious, a machine learning approach such as kNN or tree-based imputation is your best bet.

5. How to estimate uncertainty from imputation?

You've also learned about two methods to estimate the uncertainty from imputation: bootstrapping and mice. Which one to pick? Again, let me offer some loose guidelines. If your application has to be relatively fast or if you have ideas about which models to use and how to specify them, then mice is the way to go. Otherwise, if you would like to use a non-parametric method such as kNN or hot-deck, or if you simply don't want to worry about the assumptions of specific models, then the bootstrap might be a better choice.

6. Next steps

One last thing before you go. If you would like to dig even deeper into the arcane imputation knowledge, I suggest you google miceVignettes. The authors of the mice package provide a series of six vignettes with R code and examples which touch upon issues such as passive imputation, post-processing of imputed data, imputing multi-level data or sensitivity analysis. If you easily absorb knowledge from books, then "Flexible Imputation of Missing Data" by Stef van Buuren is a must-read. It also uses the mice package. Finally, there are some other great R packages worth exploring that we had no time to cover in this course, such as Amelia or mi, which allow for imputing time series and panel data.

7. Congratulations and good luck!

Once again, congratulations on finishing the course and thanks for staying with me. I hope the knowledge and skills you've gained will make your work with incomplete data easier and more productive. Good luck!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Handling Missing Data with Imputations in R

AdvancedSkill Level

4.7+

51 reviews