1. Multiple imputation by bootstrapping
Welcome to the final chapter of the course, in which we will discuss multiple imputation - a technique that allows accounting for uncertainty from imputation. Let's see why it's so important!
2. Uncertainty from imputation
We typically don't impute data for its own sake. We do it so that we can perform some analysis or modeling. However, imputed values come with some uncertainty - we cannot be sure how correct our imputation is. This uncertainty should be accounted for in any analyses that we carry out on imputed data.
To show how important it is, Ranjit Lall from Harvard University has re-created multiple studies that imputed data before analysis. He found that in almost half of these studies, key results “disappear” when uncertainty from imputation is accounted for.
3. Bootstrap
One of the methods to account for uncertainty from imputation is bootstrapping.
To create a bootstrap sample of a data frame, we draw randomly from the rows of the original data. We take as many rows as there are in the original data, so that the bootstrap sample is of the same size, but the draws are taken with replacement. This means that some rows will appear more than once in the bootstrap sample, and some will not be there at all. Let's see how to use this concept for imputation.
4. Multiple imputation by bootstrapping
Multiple imputation by bootstrapping comprises a number of steps. First, we take many bootstrap samples from the original data. The picture shows three samples, although in practice you would rather want at least a few hundred. Then, we impute each sample with our method of choice. Next, we perform some analysis or modeling on each of the many imputed data sets. This could be as simple as computing a mean of a variable or as sophisticated as fitting a complex neural network. Finally, when we obtain a single result from each bootstrap sample, we put these so-called bootstrap replicates together to form a distribution of results. We can then use the mean of this distribution as a point estimate or look at the quantiles to obtain confidence intervals.
5. Bootstrapped imputation: pros & cons
Multiple imputation by bootstrapping is a convenient way of assessing the uncertainty from imputation. It works with any imputation method and can approximate quantities that are hard to calculate analytically. For instance, it can provide an approximation of regression standard errors on imputed data, for which there is no closed-form formula. Multiple imputation by bootstrapping works with MAR and MCAR data.
The disadvantage of the bootstrap is the time it takes to run, especially if you need many replicates or if the computation done on each sample is time-consuming.
6. Bootstrapping in practice
To implement multiple imputation by bootstrapping in R, we first need to define a function that calculates our statistic of interest. Let's say we want to know the correlation between weight and total cholesterol in the nhanes data. These two variables both have some missing values. Let's start with a skeleton of the function. We will call it calc_correlation. It has to take two arguments: data, a full data frame, and indices, indicating rows to be selected in a single bootstrap sample. Finally, the function returns the correlation coefficient that we need to calculate.
7. Bootstrapping in practice
First, we create a bootstrap sample by slicing data with indices. Let's call it data_boot.
8. Bootstrapping in practice
Next, we impute the bootstrap sample, for instance with the kNN imputation. Let's call this result data_imp.
9. Bootstrapping in practice
Finally, we calculate the statistic of interest, which is the correlation coefficient between weight and total cholesterol, based on the imputed data. Our calc_correlation function is now ready to be fed into the bootstrapping algorithm.
Internally, the bootstrapping function will create the indices multiple times based on the data and it will call calc_correlation repeatedly with different indices.
10. Bootstrapping in practice
We load the boot library and call the boot function on the nhanes data. We pass our calc_correlation as statistic. We also set R, the number of bootstrap replicates, to 50.
Once the code has ran, we can print the results. The original value is the correlation we would have obtained from the original data. Bias is the difference between the original values and the mean of bootstrap estimates of the correlation. Standard error is the standard deviation of the bootstrap replicates. Note that it is larger than the mean estimate itself, which indicates large uncertainty!
11. Plotting bootstrap results
We can also call the plot function on the bootstrapping results to see the distribution of the correlation coefficient. It looks pretty normal, so we can calculate a confidence interval based on the normal distribution.
12. Bootstrapping confidence intervals
To do so, we call boot.ci on the bootstrapping results. We specify the confidence level to be 0.95 and the type to norm, which uses normal distribution. We are 95% sure that the correlation between weight and cholesterol is between -0.0569 and 0.1054. This means that, given the uncertainty from imputing these variables, we are not even sure the correlation is positive!
13. Let's practice bootstrapping!
Let's practice bootstrapping!