Visualizing missing data patterns

1. Visualizing missing data patterns

Welcome back! In this final lesson of Chapter 1, we will look at how to detect missing data mechanisms with visualizations. Let's dive in!

2. Problems with the testing approach

Using statistical tests to detect missing data patterns, as you have done in the previous lesson, is a great approach, but it comes with some problems. First, it can get cumbersome if you want to test for many relations between multiple variables. Second, the t-test, as all parametric tests, comes with some assumptions about the data which may not hold in reality. Finally, inferences based on the p-values are prone to problems such as choosing the significance level or p-hacking, which means conducting a lot of tests, thus increasing the probability that some of them will turn out significant, and then relying on the outcomes of these selected few.

3. Visualizing missing data

Another approach is to use visualizations. They are easy to use and allow not only to detect missing data patterns but also provide insights into other aspects of data quality. How to plot the data that are not there, you might ask. The "VIM" package is the answer. It has a great set of tools for plotting missing data. In this lesson, we will discuss three different types of plots: aggregation plot, spine plot and mosaic plot. Let's discuss them one by one.

4. Aggregation plot

An aggregation plot answers the question: in which combinations of variables the data are missing, and how often? To draw the plot, we pass the "nhanes" data to the "aggr" function, setting the parameters "combined" and "numbers" to TRUE. The plot shows a grid that presents all combinations of missing (red) and observed (blue) values present in different variables. The bars to the right of the grid denote the percentage of the observations with the corresponding pattern, while the bars on top show the missing percentage for each variable. From the bottom row, we see that for roughly 84% of the observations there are no missing data in any variable. The second row from the bottom tells us that almost 9% of the observations have a missing value only for total cholesterol.

5. Spine plot

Another useful visualization is the spine plot, which allows us to study the percentage of missing values in one variable for different values of the other. This is similar in spirit to the t-test from the previous lesson. To draw the spine plot, we first select two variables from the data: one according to which we want to split the data and one whose missing percentage we want to study. Then, we pass the selected variables to the "spineMiss" function. Here, we split by "Gender" and study the missing percentage of "TotChol". The relative width of the bar for each gender mirrors its frequency: there are slightly more males in the dataset. Within each bar, the missing percentage of "TotChol" is shown. It seems to be roughly equal for both genders.

6. Mosaic plot

The final visualization in this lesson is the mosaic plot. It can be though of as a generalization of the spine plot to more variables. This plot is a collection of tiles, where each tile corresponds to a specific combination of categories (for categorical variables) or bins (for numeric variables). Within each tile, the percentage of missing data points in another variable is shown. To draw the plot, we pass the data to the "mosaicMiss" function with the variable whose missing percentage we want to study passed to the "highlight" argument, and the ones to split by passed as a vector to the "plotvars" argument. Here, we look at the missing values in "TotChol" split by "Gender" and "PhysActive". The bottom right tile suggests the most missing cholesterol values are for males with no data about physical activity.

7. Let's plot what's missing!

Let's plot what's missing!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.