Get startedGet started for free

What makes a good imputation

1. What makes a good imputation

Imputing missing values needs to be done with care - you want to avoid imputing unlikely values like pregnant males, or winter temperatures on a warm summers day.

2. Lesson overview

In this lesson we assess the features of good and bad imputations. You will learn how to evaluate imputed values by using visualizations to assess their key features: the mean/median, scale, and spread.

3. Understanding the good by understanding the bad

To understand good imputation, we must first understand bad imputations. One particularly bad imputation is mean imputation, which takes the mean of complete values and uses this to impute it into the missing values. For example, in a dataframe with 5 value, with one missing, we calculate the mean from complete observations using na-dot-rm equals TRUE, and use this to impute the missing values.

4. Demonstrating mean imputation

This is generally a terribly idea. For example, imputing the middle value in this graph, we get this: The mean does not respect the process that is going on with the data. Visualization very clearly shows this pattern!

5. Explore bad imputations: The mean

To examine these bad imputations, we use the impute_mean function from the naniar package. Similar to impute_below used in the previous lesson, impute_mean has scoped variants, so it can work on a vector, on variables based on some condition like are they numeric, for specified variables, or for all variables.

6. Tracking missing values

To visualize imputations we use the same process as for impute_below. We first create nabular data by binding the shadow to track missing values. Then, we do our imputations. Then, we add a label to identify cases with missing observations using add_label_shadow. One thing to keep in mind is to use the only_miss option to bind only columns with missing values. This makes the data bit smaller and easier to handle.

7. Exploring imputations using a box plot

Now that we know how to impute our data, let's explore it. We can explore the imputed values in the same way we did for the previous lesson. But this time our intention is different, and we want to consider evaluating imputations by looking for changes in the mean, the spread, and the scale. We can evaluate changes in the mean or median using a boxplot

8. Visualizing imputations using the box plot

To visualize the ozone data in a box plot, we put the missingness of ozone, ozone_NA, on the x axis, and the values of ozone on the y axis, and use geom_boxplot. This visualization shows us that the median value is similar in each group, but that the median is lower for the not missing group. The take away message from this is that a measure of average like the mean isn't changing. This is good, but there is more than one feature to explore!

9. Explore bad imputations using a scatter plot

The spread of imputations can be explored using a scatter plot. First, we pass ggplot our data, putting Ozone and solar radiation on the x and y axis, and coloring according to missingness, any_missing. This visualization shows there is no variation in the spread of the points! Although we do notice that the imputed values are within a sane range of the data.

10. Exploring imputations for many variables

To explore many variables, we use the shadow_long function to give us data in the right long format. Here, we enter in our data, followed by the variables that we want to focus on - in this case, Ozone and Solar-dot-R. This returns to us data with the columns variable, value, and the shadow columns, variable_NA and value_NA.

11. Exploring imputations for many variables

We can then use this in a ggplot, placing value in the x axis, and filling by the missingness of the value, value_NA, and then using geom_histogram, faceting by variable.

12. Let's Practice!

Now that we can explore and assess our imputations, let's practice!