Replicating data variability

1. Replicating data variability

Welcome! In this lesson, we will continue investigating the model-based approach. We will look at how to implement it for binary variables, while discussing an important topic: variability in imputed data.

2. Variability in imputed data

You might remember this margin plot from Chapter 1. It shows the results of mean imputation of Height and Weight from nhanes data. Back then, we have said that this method provides no variability in imputed data. This is bad, because we would like the imputation to replicate the variability of observed data. In the model-based approach, if the imputing model gets the same values for all predictors for multiple observations, they would all get the same prediction. But there is a way to fix it: drawing from conditional distributions. Let's see what this means in the next slide.

3. What is a prediction

What, actually, is a prediction? Most statistical models estimate the conditional distribution of the response variable. We write it as the probability of y given X, and we read it as: what is the probability of the response y taking a certain value given some values of the predictors X. To make a single prediction, this conditional distribution has to be summarized to a single number. In linear regression, we take the expected value, or the mean, of the distribution. In logistic regression, be it binary or multinomial, we pick the class with the highest probability. Another approach is to draw from this distribution instead in order to increase the variability in imputed data. Let's discuss how to do it in the next slide.

4. Drawing from conditional distributions

In case of linear regression, we assume that the errors are normally distributed and the model provides an estimate of the error variance. Let's say the prediction for a specific value is 25 and the error variance is 5. To increase variability, instead of imputing the predicted value of 25, we impute with the value drawn from a normal distribution with mean 25 and variance 5. This means we could impute 25, or 20, or 35, or any other value depending on the random draw.

5. Drawing from conditional distributions

In case of binary logistic regression, when the response can only take two values, say 0 or 1, the model predicts the probability of class 1. Assume it's 70% for a couple of observations. It is a common practice to check if the predicted probability is higher than some threshold, often 0.5, and predict one if it is and zero otherwise. Instead, we could draw from the binomial distribution with classes 0 and 1 and probabilities 30% and 70%, respectively. This way, out of 100 cases with 70% probability, for 70 we would predict 1, and for 30 we would predict zero. This increases the variability in imputed data! Let's see how to perform logistic regression imputation in practice.

6. Logistic regression imputation

Let's impute the PhysActive variable from the nhanes data with logistic regression. It is a binary variable that is 1 when a person is physically active and 0 otherwise. First, we initialize the missing values with hotdeck imputation. Next, we check the locations where PhysActive was originally missing and save them to missing_physactive.

7. Logistic regression imputation

Then, we use R's built-in glm function to fit the logistic regression model. We will explain PhysActive with Age, Weight and Pulse from the nhanes data. Setting family to binomial makes it a logistic regression model.

8. Logistic regression imputation

Next, we make predictions on the training data. We just need to set type to response, to make sure the predictions are expressed as probabilities. We save them as preds.

9. Logistic regression imputation

The predictions are values between 0 and 1, and we need either 0 or 1. Let's take a threshold of 0.5 and convert all predictions above this threshold to 1, and those below it to zero.

10. Logistic regression imputation

Finally, we slice the data to extract the PhysActive variable at locations where it was missing, and assign the predicted values to those locations.

11. Logistic regression imputation

Let's check how variable the imputed values are. Not at all! In all 26 cases, the value 1 was imputed. This doesn't look like what we see in the observed data.

12. Drawing from class probabilities

Let's now adjust the code you've just seen to apply drawing from conditional distribution. This will only require changing one line.

13. Drawing from class probabilities

We no longer set the predictions to 0 or 1 according to a threshold. Instead, we use the rbinom function to draw from the binomial distribution.

14. Drawing from class probabilities

We set the number of observation to the length of preds, size, which denotes the number of trials, to 1, and the probabilities of classes to preds. This way, we obtain preds that are 0 or 1, but sampled according the the probabilities estimated by the model. To sum it up, we have simply replaced the 0.5 thresholds with drawing from the distribution of the predictions.

15. Drawing from class probabilities

Let's look at the variability in imputed data. This time it's larger, with class 0 predicted in 5 out of 26 cases. This relation is quite similar to the one in original, observed data: there is more or less 4 times more physically active than inactive people.

16. Let's practice replicating data variability!

Let's practice replicating data variability!

This exercise is part of the course

Handling Missing Data with Imputations in R

AdvancedSkill Level

4.8+

Start Course for Free

In this chapter, you’ll find out why missing data can be a risk when analyzing a dataset. You’ll be introduced to the three missing data mechanisms and learn how to recognize them using statistical tests and visualization tools.

Exercise 1: Missing data: what can go wrong Exercise 2: Linear regression with incomplete data Exercise 3: Analyzing regression output Exercise 4: Comparing models Exercise 5: Missing data mechanisms Exercise 6: Recognizing missing data mechanisms Exercise 7: t-test for MAR: data preparation Exercise 8: t-test for MAR: interpretation Exercise 9: Visualizing missing data patterns Exercise 10: Aggregation plot Exercise 11: Spine plot Exercise 12: Mosaic plot

Get to know the taxonomy of imputation methods and learn three donor-based techniques: mean, hot-deck, and k-Nearest-Neighbors imputation. You’ll look under the hood to see how these methods work, before learning how to apply them to a real-world tropical weather dataset. Along the way, you’ll also learn useful tricks that you can use to make them work even better for your problems.

Exercise 1: Mean imputation Exercise 2: Smelling the danger of mean imputation Exercise 3: Mean-imputing the temperature Exercise 4: Assessing imputation quality with margin plot Exercise 5: Hot-deck imputation Exercise 6: Vanilla hot-deck Exercise 7: Hot-deck tricks & tips I: imputing within domains Exercise 8: Hot-deck tricks & tips II: sorting by correlated variables Exercise 9: k-Nearest-Neighbors imputation Exercise 10: Choosing the number of neighbors Exercise 11: kNN tricks & tips I: weighting donors Exercise 12: kNN tricks & tips II: sorting variables

It’s time to learn how to use statistical and machine learning models, such as linear regression, logistic regression, and random forests, to impute missing data. In this chapter, you’ll look into how the models make their predictions and use this knowledge to draw the imputed values from conditional distributions. This is important as it ensures your imputations are more varied and plausible, making them more similar to the true data.

Exercise 1: Model-based imputation approach Exercise 2: Linear regression imputation Exercise 3: Initializing missing values & iterating over variables Exercise 4: Detecting convergence Exercise 5: Replicating data variability

Current Exercise

Exercise 6: Logistic regression imputation Exercise 7: Drawing from conditional distribution Exercise 8: Model-based imputation with multiple variable types Exercise 9: Tree-based imputation Exercise 10: Imputing with random forests Exercise 11: Variable-wise imputation errors Exercise 12: Speed-accuracy trade-off

Imputed values are not set in stone. They are just estimates and estimates come with some uncertainty. In this final chapter, you’ll discover how bootstrapping and chained equation using the mice package can be used to incorporate imputation uncertainty into your models and analyses to make them more reliable and robust.

Exercise 1: Multiple imputation by bootstrapping Exercise 2: Wrapping imputation & modeling in a function Exercise 3: Running the bootstrap Exercise 4: Bootstrapping confidence intervals Exercise 5: Multiple imputation by chained equations Exercise 6: The mice flow: mice - with - pool Exercise 7: Choosing default models Exercise 8: Using predictor matrix Exercise 9: Putting it all together Exercise 10: Analyzing missing data patterns Exercise 11: Imputing and inspecting outcomes Exercise 12: Inference with imputed data Exercise 13: Final remarks