1
The Problem of Missing Data
Free
In this chapter, you’ll find out why missing data can be a risk when analyzing a dataset. You’ll be introduced to the three missing data mechanisms and learn how to recognize them using statistical tests and visualization tools.
2
Donor-Based Imputation
Get to know the taxonomy of imputation methods and learn three donor-based techniques: mean, hot-deck, and k-Nearest-Neighbors imputation. You’ll look under the hood to see how these methods work, before learning how to apply them to a real-world tropical weather dataset. Along the way, you’ll also learn useful tricks that you can use to make them work even better for your problems.
3
Model-Based Imputation
It’s time to learn how to use statistical and machine learning models, such as linear regression, logistic regression, and random forests, to impute missing data. In this chapter, you’ll look into how the models make their predictions and use this knowledge to draw the imputed values from conditional distributions. This is important as it ensures your imputations are more varied and plausible, making them more similar to the true data.
4
Uncertainty from Imputation
Imputed values are not set in stone. They are just estimates and estimates come with some uncertainty. In this final chapter, you’ll discover how bootstrapping and chained equation using the mice package can be used to incorporate imputation uncertainty into your models and analyses to make them more reliable and robust.

Initializing

Using predictor matrix

An important decision that needs to be taken when using model-based imputation is which variables should be included as predictors, and in which models. In mice(), this is governed by the predictor matrix and by default, all variables are used to impute all others.

In case of many variables in the data or little time to do a proper model selection, you can use mice's functionality to create a predictor matrix based on the correlations between the variables. This matrix can then be passed to mice(). In this exercise, you will practice exactly this: you will first build a predictor matrix such that each variable will be imputed using variables most correlated to it; then, you will feed your predictor matrix to the imputing function. Let's try this simple model selection!

Create a predictor matrix that picks predictors with a correlation with the target of at least 0.1, assigning the result to pred_mat.