Multiple imputation by chained equations

1. Multiple imputation by chained equations

Welcome back! In the previous lesson, you've learned how to estimate the uncertainty from imputation using bootstrapping. This lesson covers another approach called multiple imputation by chained equations.

2. The MICE algorithm

This algorithm, abbreviated to "MICE", works somewhat similarly to the bootstrap: a model is fit to multiple imputed data sets and the results are combined. Unlike the bootstrap, though, MICE doesn't create many data sets to impute. Instead, the same, original data are imputed a couple of times with statistical models using drawing from conditional distributions that you learned about when we talked about increasing variability in imputed data. Since the imputed values are drawn randomly, in every imputation a different value will be drawn to replace the same missing value. This allows us to obtain many differently imputed data sets. Then, a model is fit to every imputed data set and the results are pooled to obtain the mean and variance of interesting quantities, such as regression coefficients. The diagram illustrates this process, while also showing the three R functions from the "mice" package that are used to implement it: "mice" imputes multiple times, "with" fits a model and "pool" aggregates the results. You will see this in the code shortly.

3. MICE: pros & cons

MICE's largest advantage over the bootstrap is that it requires much fewer replications to produce reliable results. In most cases, a few tens are enough. Just like the bootstrap, it works with MAR and MCAR data. A limitation of MICE is that it only works with model-based imputation methods that allow the construction of conditional distributions to draw from. This disallows methods such as hot-deck or kNN imputation. Also, MICE requires some effort to tune: one has to pick the model for each variable as well as the predictors to include.

4. The mice flow: mice - with - pool

Let's see how to implement it in practice on nhanes data. First, we load the mice package and call the function also called mice on the nhanes data. We set "m", the number of imputations, to 20 and call the result, a multiply imputed data set, nhanes_multiimp. Next, we fit a linear model explaining Weight to each imputed data set using the with function. The result, which we call lm_multiimp, is a collection of regression models. To pool their results together, we call the pool function.

5. Analyzing pooled results

To analyze the pooled results, we can feed them to the summary function, while also setting conf.int to TRUE and conf.level to 0.95 to obtain confidence intervals with the confidence level of 95%. We can see the pooled estimates of the regression coefficients, their standard errors, and, in the last two columns, the lower and upper bounds of the confidence intervals that account for imputation uncertainty.

6. MICE: available methods

In the MICE algorithm, each variable has its own model that imputes it. The mice package offers a range of possible model choices, depending on the type of the variable in question. Let's see how to choose which method to use.

7. Choosing methods per variable type

While it is possible to set a separate imputation method per every single variable, a popular choice is to set default methods for each variable type. To do this, we can use the defaultMethod argument to the mice function, which should be a vector of 4 strings, specifying the methods for the 4 variables types. These are, respectively, continuous, binary, categorical and factor variables. Here, we choose predictive mean matching to impute continuous variables, logistic regression for binary ones, multinomial logistic regression for categorical variables and ordered logit for ordered factors. These models are simple and quick to fit.

8. Predictor matrix

The final topic we'll discuss is how to choose which variables should be included as predictors, and in which models. This is governed be the predictor matrix, whose rows and columns denote the variables, and the values of 0s and 1s indicate whether the variable in the corresponding column was used to impute the one in the corresponding row. The predictor matrix that was used can be extracted from a multiply imputed data using the $predictorMatrix notation. Here, we don't specify it when calling mice. As a result, we use the default, in which all variables are used to predict all other variables, but themselves. But is there a better choice?

9. Choosing predictors for each variable

Ideally, one should perform a proper model selection to pick the best predictors. However, a quick alternative can already improve performance: for each variable, we can pick predictors that are most correlated to it. To do it, we can create a correlation-based predictor matrix with the quickpred function. The mincor argument specifies the threshold for the correlation coefficient above which the predictor is included. Then, we run the mice function as usual, passing in predictor matrix as the corresponding argument. We can also take a look at the predictor matrix we created. From this truncated output you can see that the model imputing Weight is using Age, Gender and Pulse as predictors.

10. Let's practice imputing with MICE!

Let's practice imputing with MICE!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.