Replicating data variability

1. Replicating data variability

Welcome! In this lesson, we will continue investigating the model-based approach. We will look at how to implement it for binary variables, while discussing an important topic: variability in imputed data.

2. Variability in imputed data

You might remember this margin plot from Chapter 1. It shows the results of mean imputation of Height and Weight from nhanes data. Back then, we have said that this method provides no variability in imputed data. This is bad, because we would like the imputation to replicate the variability of observed data. In the model-based approach, if the imputing model gets the same values for all predictors for multiple observations, they would all get the same prediction. But there is a way to fix it: drawing from conditional distributions. Let's see what this means in the next slide.

3. What is a prediction

What, actually, is a prediction? Most statistical models estimate the conditional distribution of the response variable. We write it as the probability of y given X, and we read it as: what is the probability of the response y taking a certain value given some values of the predictors X. To make a single prediction, this conditional distribution has to be summarized to a single number. In linear regression, we take the expected value, or the mean, of the distribution. In logistic regression, be it binary or multinomial, we pick the class with the highest probability. Another approach is to draw from this distribution instead in order to increase the variability in imputed data. Let's discuss how to do it in the next slide.

4. Drawing from conditional distributions

In case of linear regression, we assume that the errors are normally distributed and the model provides an estimate of the error variance. Let's say the prediction for a specific value is 25 and the error variance is 5. To increase variability, instead of imputing the predicted value of 25, we impute with the value drawn from a normal distribution with mean 25 and variance 5. This means we could impute 25, or 20, or 35, or any other value depending on the random draw.

5. Drawing from conditional distributions

In case of binary logistic regression, when the response can only take two values, say 0 or 1, the model predicts the probability of class 1. Assume it's 70% for a couple of observations. It is a common practice to check if the predicted probability is higher than some threshold, often 0.5, and predict one if it is and zero otherwise. Instead, we could draw from the binomial distribution with classes 0 and 1 and probabilities 30% and 70%, respectively. This way, out of 100 cases with 70% probability, for 70 we would predict 1, and for 30 we would predict zero. This increases the variability in imputed data! Let's see how to perform logistic regression imputation in practice.

6. Logistic regression imputation

Let's impute the PhysActive variable from the nhanes data with logistic regression. It is a binary variable that is 1 when a person is physically active and 0 otherwise. First, we initialize the missing values with hotdeck imputation. Next, we check the locations where PhysActive was originally missing and save them to missing_physactive.

7. Logistic regression imputation

Then, we use R's built-in glm function to fit the logistic regression model. We will explain PhysActive with Age, Weight and Pulse from the nhanes data. Setting family to binomial makes it a logistic regression model.

8. Logistic regression imputation

Next, we make predictions on the training data. We just need to set type to response, to make sure the predictions are expressed as probabilities. We save them as preds.

9. Logistic regression imputation

The predictions are values between 0 and 1, and we need either 0 or 1. Let's take a threshold of 0.5 and convert all predictions above this threshold to 1, and those below it to zero.

10. Logistic regression imputation

Finally, we slice the data to extract the PhysActive variable at locations where it was missing, and assign the predicted values to those locations.

11. Logistic regression imputation

Let's check how variable the imputed values are. Not at all! In all 26 cases, the value 1 was imputed. This doesn't look like what we see in the observed data.

12. Drawing from class probabilities

Let's now adjust the code you've just seen to apply drawing from conditional distribution. This will only require changing one line.

13. Drawing from class probabilities

We no longer set the predictions to 0 or 1 according to a threshold. Instead, we use the rbinom function to draw from the binomial distribution.

14. Drawing from class probabilities

We set the number of observation to the length of preds, size, which denotes the number of trials, to 1, and the probabilities of classes to preds. This way, we obtain preds that are 0 or 1, but sampled according the the probabilities estimated by the model. To sum it up, we have simply replaced the 0.5 thresholds with drawing from the distribution of the predictions.

15. Drawing from class probabilities

Let's look at the variability in imputed data. This time it's larger, with class 0 predicted in 5 out of 26 cases. This relation is quite similar to the one in original, observed data: there is more or less 4 times more physically active than inactive people.

16. Let's practice replicating data variability!

Let's practice replicating data variability!