Evaluating imputations and models

1. Assessing inference from imputed data in a modelling context

Let's step back, and think about why we are imputing data in the first place. Our goal in performing imputations is that we want to perform an analysis. In this lesson we discuss methods for assessing model inference across differently imputed datasets.

2. Exploring parameters of one model

Let's fit a model to the airquality dataset using a linear model, predicting Temperature, using Ozone, Solar radiation, Wind, Month and Day. We are going to fit this model using two methods 1. Complete case analysis, where we remove all rows that contain a missing value 2. Imputing data using the linear model imputation from the last lesson.

3. Combining the datasets together

There are three steps to comparing our data. First, we perform the complete case analysis, adding information the shadow data, so we have the same number of columns, so they can be bound together. Then, we impute the data according to a linear model Finally, we combine the different datasets together. This prepares us for fitting our new models, so we can summarize and compare differences in the data.

4. Combining the datasets together

The bound models have a column imp_model, then the columns from airquality, and our shadow variables and any_missing.

5. Exploring the models

Now that we've got our data in the right format, we fit a linear model to each of the datasets. We use the "many models" approach, which is covered in more detail in the R for data science book by Hadley Wickham and Garrett Grolemund. This involves some functions that we haven't seen before. First we group by the imputation model, then nest the data. This collapses, or nests, the data down into a neat format where each row is one of our datasets. This allows us to create linear models on each row of the data, using mutate, and a special function, map. This tells the function we are applying to look at the data. Then we then fit the model and create separate columns for residuals, predictions, and coefficients, using the tidy function from broom, to provide nicely formatted coefficients from our linear model. Our data, model_summary, has the columns imp_model, and data, and columns with our fitted linear model (mod), residuals (res), predictions (pred), and tidy coefficients (tidy). model_summary forms the building block for the next steps in our analysis, where we are going to look at the coefficients, the residuals, and the predictions. This is just one way to fit these kinds of models - it might not work for all types of models, but it is convenient!

6. Exploring coefficients of multiple models

We explore coefficients by selecting the imputation model and the tidy column and unnesting. Here we see that the estimates of the impact of Ozone on temperature are slightly higher for the imputed dataset - note that the significance of the effect does not change.

7. Exploring residuals of multiple models

Let's explore the residuals by selecting the imp model and res, and then unnesting the data. We can then create a histogram, using position = dodge to put residuals for each model next to each other. We see that, surprisingly! there isn't much difference between the two.

8. Exploring predictions of multiple models

Finally, we can explore the predictions in the data, using a similar pattern. Similar to what we saw with the residuals, the predictions are quite similar to complete case, but with some more extreme values.

9. Let's practice!

We have covered the basics of how to explore model features for different methods, now let's explore this in more detail with some practice exercises

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Dealing With Missing Data in R

BeginnerSkill Level

4.8+

78 reviews