Missing Data dependence

1. Missing Data dependence

Having cleaned up our messy missing data, we need to work out what we're going to do with our missing values: Should we delete or impute them? Deleting and Imputing values can have some serious implications on future decisions that we make from our data. To help us frame these decisions we need to discuss some concepts in missing data theory - missing data dependence.

2. Outline

In this lesson, we discuss a couple flavors of missingness. MCAR - Missing Completely at Random; MAR - Missing At Random; and MNAR - Missing Not At Random.

3. MCAR: What is it?

Missing Completely at Random, or MCAR is where the missingness has no association with any data you have observed, or not observed. For example, test scores from a workplace may be missing for some workers on vacation. Now, if there is no relationship between the timing of the tests and the timing of vacations - that is, people aren't taking a vacation to avoid a test! And if the workers on vacation aren't too different to those who aren't on vacation, then these missing data can be considered MCAR.

4. MCAR: What are the implications

So what does this mean? You should impute, or fill in your data if it is MCAR. Deleting observations with missing values may be appropriate, but you should be careful - you can lose a lot of your data if you aren't paying attention. Ideally do not delete unless there is less than 5% data loss. But really, you should be imputing your data.

5. MAR: What is it?

Missing at Random, or MAR is where missingness depends on data you have observed, but not data unobserved. Say for example then that test scores are more likely to be missing for workers with high depression, then the data can be considered MAR. MAR data means you should be carefully imputing your data. Deleting observations with missing values is not appropriate, as you will likely bias your results.

6. MNAR: What is it?

Data MNAR is where the response missingness is related to an unobserved value relevant to the assessment of interest. So if the association between test scores and depression is known, but both test scores AND depression are missing, and there are no high depression scores, we could consider these data to be MNAR. It is important to recognize MNAR as it introduces bias into the estimation of associations and parameters of interest.

7. Example: MCAR

Now we are going to cover some visualizations to show what certain missingness structures might look like. Looking at our data mt_cars, we have applied some clustering to the missingness - and we see that there is still a lot of noise in the missingness. We can also try arranging by a few different variables, but the important thing to take away here is that "random" or "noisy" looking pattern generally suggests there isn't much variation going on in our data. We could say that it is MCAR.

8. Example: MAR

We can do something similar for another dataset, oceanbuoys. Arranging by variable year we see that there is some definite clustering of missingness - this is a common symptom of data MAR.

9. Example: MNAR

Finally, here is some data MNAR. Here, we have our ocean data, but I have made wind variables be missing according to a variable I have removed from the dataset - something now unobserved. In this case, we can see some very clear structure, but this is not always the case. It is important to remember it can be very difficult to ascertain whether missingness MCAR, MAR or MNAR. These visualizations are one way to explore missingness, but they are not definitive - we will cover some more useful methods later on in the course.

10. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Dealing With Missing Data in R

BeginnerSkill Level

4.8+

135 reviews