1. Mean imputation
Congratulations on making it past Chapter 1! Now that you can diagnose missing data mechanisms, it is time to look at the methods to fill in the missing values.
2. Imputation vocabulary
Let's kick off with some basic vocabulary. Imputation means making an educated guess about what the missing values might be.
There are two basic types of imputation. First, for each observation with some incomplete data, we can replace them with data from other, complete observations. The complete observations that donate their data to the incomplete ones are called donors, and this family of methods is known as donor-based imputation.
Second, the missing values can be predicted with a statistical or machine learning model. This approach is referred to as model-based imputation.
The model-based methods are the topic of the next chapter. In this chapter, we will discuss three donor-based methods: mean, hot-deck and kNN imputation. Let's get to it!
3. Mean imputation
One of the simplest imputation methods is the mean imputation. It boils down to simply replacing the missing values for each variable with the mean of its observed values. In the example, we take the mean of 2, 3, 2 and 5, which is 3, and use it as the replacement for the missing value.
Mean imputation can work well for time-series data that randomly fluctuate around some long-term average, such as stock price changes.
However, some practitioners treat mean imputation as a default go-to method also for cross-sectional data, forgetting that it is often a very poor choice. Mean imputation has two major drawbacks: it destroys the relations between the variables and provides no variance in the imputed data. Let's see what this means in the next slides.
4. Mean imputation in practice
Let's perform mean imputation on the "Height" and "Weight" variables from the NHANES data you have seen in the previous chapter.
First, we create binary indicators for whether each value was originally missing. They will be useful for visualizing the imputed data. The two mutate statements create two new variables with the suffixes "_imp" that are TRUE if the corresponding variable is missing and will be imputed.
Then, we replace the missing values with the means. The two mutate statements overwrite "Height" and "Weight" with their respective means if they are missing and leave them unchanged otherwise.
5. Mean-imputed NHANES data
This is how the first few rows of the variables of interest look like. The value "TRUE" for "Height_imp" in the first row indicates that the corresponding value of "Height", that is 166.2499, is the imputed mean and this value was originally missing. Notice that the same value was imputed in the second row. Actually, the average height in the data amounts to 166.2499 and this value was inserted everywhere where height was missing.
6. Assessing imputation quality: margin plot
A good way to assess the quality of imputation is to visualize the imputed values against the original data.
For two numeric variables, such as "Height" and "Weight", we can draw a margin plot. To do this, we select the two variables alongside the binary indicators we have created previously and pass them to the "marginplot" function from the VIM package.
We set the delimiter to "imp" to tell the function what is the suffix of the binary indicators for imputed values.
The margin plot is basically a scatter plot of "Weight" versus "Height". The blue circles are values observed in both variables, while the orange ones are imputed.
We clearly see that the positive relation between these two variables has been totally destroyed in the imputed values.
Also, there is no variation whatsoever among the imputed data.
7. Troubles with mean imputation
You have just seen the two major drawbacks of mean imputation: destroyed relations between variables and lack of variability in imputed data. But why are they so bad?
Let's look at the former first. After mean-imputing `Height` and `Weight`, their positive correlation is weaker. Hence, models predicting one using the other will be fooled by the outlying imputed values and will produce biased results.
As far as the lack of variability is concerned: with less variance in the data, all standard errors will be underestimated. This prevents reliable hypothesis testing and calculation of confidence intervals.
8. Median and mode imputation
A final remark: instead of the mean, one might impute with a median or a mode. Median imputation could be a better choice when there are outliers in the data. In such cases, the mean can be driven arbitrarily large or small with even a single outlier, while the median will stay closer to most of the data. For categorical variables, we cannot compute neither mean or median, so we use the mode instead. However, both median imputation (shown in the plot for our height and weight example) and mode imputation present the same drawbacks as mean imputation: no variance in imputed values and broken relations between variables.
9. Let's practice!
Let's get our hands dirty with mean-imputing some data!