1. Hot-deck imputation
Welcome back! In this lesson, you will learn about a donor-based method called hot-deck imputation, which is a good alternative to mean imputation.
2. Hot-deck's history
Hot-deck imputation method dates back to the 1950s, when data was stored on punched cards, like the one in the picture. As browsing back and forth through the data stored in this format was extremely slow, an imputation method was needed that would require only one pass through the data. And so, the U.S. Census Bureau came up with an idea.
3. Hot-deck imputation
Hot-deck imputation boils down to simply replacing every missing value with the last observed value in the same variable. In other words, we just feed forward the last non-missing value. By the way, the term hot-deck refers to the deck of punched cards actually being processed.
Hot-deck imputation has some drawbacks. It assumes that the data are MCAR and only then produces unbiased estimates of the missing values. In its vanilla form, it may also destroy relations between variables. On the other hand, it's fast, especially compared to more complex methods that we'll cover in a later chapter. Unlike with mean imputation, the data imputed by hot-deck are not constant. And most importantly, the relation-breaking can be prevented with some small tricks. These tricks are what makes hot-deck a great alternative to the mean imputation.
4. Hot-deck imputation in practice
Let's see how it works in practice. Consider the NHANES health survey data, in which we would like to impute height and weight. To do this, we simply call the "hotdeck" function from the "VIM" package, specifying which variables to impute in the "variable" argument. Let's take a look at the imputed data. Notice the last two columns, "Height_imp" and "Weight_imp". They are the binary indicators for where the corresponding variables were imputed. In the previous lesson on mean imputation, you have created them manually in order to visualize the imputed data. "hotdeck", as well as most other functions in the "VIM" package, takes care of it for you.
5. Imputing within domains
As we have mentioned, hot-deck in its vanilla form may break relations between variables.
Consider this example: we might expect physically active people to have, on average, lower weight than those who are not active.
However, if active and inactive people are mixed in the dataset, hot-deck can feed an inactive person's weight forward to an active person, destroying the relation between weight and physical activity.
A simple solution is to impute within domains, that is, separately for active and inactive people. This way, each active person will receive a value from an also active donor, and vice versa. To implement this trick, it's enough to pass one more argument to the "hotdeck" function, called "domain_var", and set it to "PhysActive".
6. Sorting by correlated variables
Imputing within domains is only possible with a categorical variable specifying the domains. And what if the variable to be imputed is correlated with a continuous variable?
For instance, weight and height tend to be positively correlated. If we run the vanilla hot-deck, we might impute a short person's weight with a tall person's weight.
In order to avoid it, while imputing weight, we can first sort the data by height. This way, every missing weight value will be replaced with a value coming from a donor of similar hight.
To do this, we simply need to pass the "ord_var" argument to "hotdeck", telling the function how to order data before performing imputation.
7. Let's practice hot-deck-imputing!
Let's practice hot-deck-imputing!