1. Performing and tracking imputation
Exploring missing data helps us understand the data and make sure that we are happy with it for analysis.
Once we understand our data and the relationships amongst the variables and the missingness, it is a good idea to perform imputation, so that you can perform analysis.
2. Lesson overview
In this lesson, we are going to focus on using imputations to understand data structure, visualizing and exploring imputed values, and develop the following skills. Imputing data and tracking missing values, and visualizing imputed values against data
Some of these techniques might look familiar, because we have covered some of these in previous lessons.
This is one of the benefits to using naniar; the methods applied for exploring missing values are similar to exploring imputations.
3. Using imputations to understand data structure
In past lessons we have used geom_miss_point to explore missing values. This "shifted" the missing values below the range of the data so that we could see them.
It also actually performed some imputations!
We are going to recreate these visualizations using the impute_below function from naniar that imputes values below the range of the data. For example, for this vector of numbers 5:10 with one missing value, it imputes the value 4-point-4 into the missing value, since this is lower than the lowest value of the data at hand, namely 5-point-000.
4. impute_below
impute_below has some useful variations that give flexibility to apply it to some, or all variables of a data.frame.
impute_below only imputes variables that satisfy a condition, like is this column numeric with is-dot-numeric?
impute_below_at imputes variables specified inside vars.
And impute_below_all will impute all variables.
5. Tracking missing values
We need to track the missing values, once we impute them. otherwise we don't know what was imputed and what was not.
We can see that in our example, once we impute the data, we have no way to recognize which one it is.
6. Tracking missing values
We can identify missings by using bind_shadow to turn the data into nabular form.
Now when we impute the data, we can see that the shadow variable, var1_NA reveals the imputed value, 4-point-40.
7. Visualize imputed values against data values using histograms
Using this imputed data, we can explore the number of missings in a single variable, along with its distribution, using a histogram and coloring the missings using fill equals Ozone_NA.
Here we see that there are a few missing values - two bars around 20, so just under 40 missing values.
8. Visualize imputed values against data values using facets
We can take this same plot and visualize it across facets. For example, plot it by Month, which shows us that most missing values occur in month 6 - which didn't have many high values of Ozone.
9. Visualize imputed values using facets
We can split the plot according to the missingness of Solar Radiation by referring to it as Solar-dot-R_NA
This shows us that there aren't many missing values in Ozone when Solar radiation is missing.
10. Visualize imputed values against data values using scatter plots
To visualize missings for two variables, we need to add a label that identifies whether there is a missing value in a column. The function add_label_shadow does this for us.
We have now recreated the same figure as geom_miss_point!
11. Let's practice!
naniar has a workflow similar for exploring missing values and imputations, which helps make it easy to learn!
Now let's practice!