Putting it all together

1. Putting it all together

Welcome back and congratulations! You have reached the final lesson of the course, in which we will put together all that you've learned.

2. Case study: civil liberties in Africa

The final few exercises will take the form of a case study, in which you will perform modeling on an incomplete data set. The data set, called "africa", comes from the Africa Research Program. It contains data on a few economic and political variables in six African States between 1972 and 1991. The variables are year, country name, Gross Domestic Product (or GDP) per capita, inflation, trade as a percentage of GDP, a measure of civil liberties and total population.

3. Modeling incomplete data

Your final goal is to investigate the relation between the civil liberties and the GDP per capita. To be able to do this, you will start with visualizing the incomplete data. This will tell you which variables are missing and what the missing data mechanisms might be. Then, you will impute missing data and inspect the quality of your imputation. Finally, you will run a regression model on the imputed data, while incorporating the uncertainty from imputation into the results.

4. What you will need

Here's a list of functions you will be using throughout this lesson. You've seen all of them in the previous chapters.

5. Assessing imputation quality with MICE

Before you set off, let's discuss one thing that we didn't look at in the previous lesson: assessing imputation quality with MICE. The result of MICE imputation is not one, but many imputed data sets. For this reason, using the functions from the VIM package to visualize each imputed data set separately would be cumbersome. Luckily, the mice package offers its own plots that automatically handle all data sets. Let's see how to use them. We first impute the nhanes data with mice. We use 5 imputations and set the default method to pmm, which stands for predictive mean matching. By passing only one string, we set the same default method for all variable types. The result, nhanes_multiimp is a mice object with multiple imputed data sets. To assess the quality of imputation, we can use the stripplot function from the mice package. Its main part is the formula: first, we place two variables that we would like to plot against each other, in this case Height and Weight, on two sides of the tilde sign. Then, after the vertical bar, we place .imp. This will create a grid of plots, one for every imputed data set. The last two parameters, pch and cex, are there just to make the plot look better: they control the dot marker type and the scaling, respectively.

6. Strip plot

This is what the result looks like. It's basically a collection of scatter plots of Height versus Weight. The one in the top-left corner, without red dots, contains only observed data. The remaining five grids show imputed values in red. The imputation seems to be a good one, because the imputed data are close to the true data: they don't constitute outliers and don't break the relation between weight and height.

7. Let's put what you've learned to practice!

Now you are ready to face the case study. Let's put what you've learned to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Handling Missing Data with Imputations in R

AdvancedSkill Level

4.7+

51 reviews