Get startedGet started for free

Tools to explore missing data dependence

1. Missing Data Workflows: The Shadow matrix and Nabular data

We've covered how to create summaries and visualize missing values. However, we have not given much detail on how to link these summaries of missingness back to values in the data. In this chapter we are going to explore special data structures to facilitate working with missing data: the Shadow Matrix, and Nabular data. Let's take a look at some data:

2. An example

Let's imagine that we have some census data that contains two columns: income, and education. We note that there are some missing values in education.

3. What we are going to cover

If we look at the distribution of income, we see that it looks like most of the values are around 70 to 80 thousand dollars a year. But if we fill the distributions according to whether or not education is missing, we see that there is a distinct gap between the two. To help us build plots like this, and explore how values are related to missingness, we are going to need some special data structures. This chapter introduces the concepts of the shadow matrix and nabular data, and demonstrates how they can be used in analysis.

4. The shadow matrix

One way to look at your data is instead of its values, instead a dataframe, it is a dataframe of ones and zeros, with each representing whether a value is missing (1) or not (0). The shadow matrix is a clearer representation of this binary form of the data. You can convert your data to a shadow matrix using as_shadow.

5. The shadow matrix

The shadow matrix has the following features: 1. Coordinated names: Variables in the shadow matrix gain the same name as in the data, with the suffix "_NA". This makes the variables missingness straightforward to refer to. This indicates that we shift our thinking from "what is this variable's values" to "what is the missingness of this variable". 2. Clear values. The values are either !NA - "not missing", or NA - "missing". This is clearer than ones and zeros for missing/not missing.

6. Creating nabular data

To get the most out of the shadow matrix, it needs to be attached, column-wise, to the data. This can be done with bind_shadow(data). Putting the data in this form is referred to as nabular data - so called because it is a portmanteau or "NA", and "Tabular". You can also use nabular instead of bind_shadow, if you like. So here we have the income values and education, and then their shadow representations - income_NA, and education_NA.

7. Using nabular data to perform summaries

Now that you can create nabular data, let's use it to do something useful, like calculate summary statistics based on the missingness of something else. We take the airquality data, then use bind_shadow to turn the data into nabular data. Note that we have the airquality variables, Ozone, Solar-dot-R etc, and the shadow matrix data, Ozone_NA, Solar-dot-R_NA and so on.

8. Using nabular data to perform summaries

We then perform some summaries on the data using group_by and summarize to calculate the mean of Wind speed, according to the missingness of Ozone. We see that the mean values of Wind are relatively similar, but slightly higher when Ozone is missing than when Ozone is not missing.

9. Let's practice!

Time to put this into practice!