t-test for MAR: data preparation

Great work on classifying the missing data mechanisms in the last exercise! Of all three, MAR is arguably the most important one to detect, as many imputation methods assume the data are MAR. This exercise will, therefore, focus on testing for MAR.

You will be working with the familiar biopics data. The goal is to test whether the number of missing values in earnings differs per subject's gender. In this exercise, you will only prepare the data for the t-test. First, you will create a dummy variable indicating missingness in earnings. Then, you will split it per gender by first filtering the data to keep one of the genders, and then pulling the dummy variable. For filtering, it might be helpful to print biopics's head() in the console and examine the gender variable.

This exercise is part of the course

Handling Missing Data with Imputations in R

Exercise instructions

Add another variable to biopics called missing_earnings that is TRUE if earnings is missing and FALSE otherwise.
Create a vector of missing_earnings values for males and assign it to missing_earnings_males.
Create a vector of missing_earnings values for females and assign it to missing_earnings_females.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a dummy variable for missing earnings
biopics <- biopics %>% 
  ___(missing_earnings = ___(___))

# Pull the missing earnings dummy for males
missing_earnings_males <- biopics %>% 
  ___(___) %>% 
  ___(___)

# Pull the missing earnings dummy for females
missing_earnings_females <- biopics %>% 
  ___(___) %>% 
  ___(___)

Edit and Run Code

This exercise is part of the course

Handling Missing Data with Imputations in R

AdvancedSkill Level

4.7+

Start Course for Free

In this chapter, you’ll find out why missing data can be a risk when analyzing a dataset. You’ll be introduced to the three missing data mechanisms and learn how to recognize them using statistical tests and visualization tools.

Exercise 1: Missing data: what can go wrong Exercise 2: Linear regression with incomplete data Exercise 3: Analyzing regression output Exercise 4: Comparing models Exercise 5: Missing data mechanisms Exercise 6: Recognizing missing data mechanisms Exercise 7: t-test for MAR: data preparation

Current Exercise

Exercise 8: t-test for MAR: interpretation Exercise 9: Visualizing missing data patterns Exercise 10: Aggregation plot Exercise 11: Spine plot Exercise 12: Mosaic plot

Get to know the taxonomy of imputation methods and learn three donor-based techniques: mean, hot-deck, and k-Nearest-Neighbors imputation. You’ll look under the hood to see how these methods work, before learning how to apply them to a real-world tropical weather dataset. Along the way, you’ll also learn useful tricks that you can use to make them work even better for your problems.

Exercise 1: Mean imputation Exercise 2: Smelling the danger of mean imputation Exercise 3: Mean-imputing the temperature Exercise 4: Assessing imputation quality with margin plot Exercise 5: Hot-deck imputation Exercise 6: Vanilla hot-deck Exercise 7: Hot-deck tricks & tips I: imputing within domains Exercise 8: Hot-deck tricks & tips II: sorting by correlated variables Exercise 9: k-Nearest-Neighbors imputation Exercise 10: Choosing the number of neighbors Exercise 11: kNN tricks & tips I: weighting donors Exercise 12: kNN tricks & tips II: sorting variables

It’s time to learn how to use statistical and machine learning models, such as linear regression, logistic regression, and random forests, to impute missing data. In this chapter, you’ll look into how the models make their predictions and use this knowledge to draw the imputed values from conditional distributions. This is important as it ensures your imputations are more varied and plausible, making them more similar to the true data.

Exercise 1: Model-based imputation approach Exercise 2: Linear regression imputation Exercise 3: Initializing missing values & iterating over variables Exercise 4: Detecting convergence Exercise 5: Replicating data variability Exercise 6: Logistic regression imputation Exercise 7: Drawing from conditional distribution Exercise 8: Model-based imputation with multiple variable types Exercise 9: Tree-based imputation Exercise 10: Imputing with random forests Exercise 11: Variable-wise imputation errors Exercise 12: Speed-accuracy trade-off

Imputed values are not set in stone. They are just estimates and estimates come with some uncertainty. In this final chapter, you’ll discover how bootstrapping and chained equation using the mice package can be used to incorporate imputation uncertainty into your models and analyses to make them more reliable and robust.

Exercise 1: Multiple imputation by bootstrapping Exercise 2: Wrapping imputation & modeling in a function Exercise 3: Running the bootstrap Exercise 4: Bootstrapping confidence intervals Exercise 5: Multiple imputation by chained equations Exercise 6: The mice flow: mice - with - pool Exercise 7: Choosing default models Exercise 8: Using predictor matrix Exercise 9: Putting it all together Exercise 10: Analyzing missing data patterns Exercise 11: Imputing and inspecting outcomes Exercise 12: Inference with imputed data Exercise 13: Final remarks