Initializing missing values & iterating over variables
As you have just seen, running impute_lm()
might not fill-in all the missing values. To ensure you impute all of them, you should initialize the missing values with a simple method, such as the hot-deck imputation you learned about in the previous chapter, which simply feeds forward the last observed value.
Moreover, a single imputation is usually not enough. It is based on the basic initialized values and could be biased. A proper approach is to iterate over the variables, imputing them one at a time in the locations where they were originally missing.
In this exercise, you will first initialize the missing values with hot-deck imputation and then loop five times over air_temp
and humidity
from the tao
data to impute them with linear regression. Let's get to it!
This exercise is part of the course
Handling Missing Data with Imputations in R
Exercise instructions
- Initialize the missing values with the
hotdeck()
imputation. - Create a boolean mask for where
humidity
was originally missing and assign it tomissing_humidity
. - Inside the for-loop, set the
humidity
intao_imp
toNA
in places where it was originally missing using the boolean mask you have created. - Inside the for-loop, impute
humidity
intao_imp
with linear regression, usingyear
,latitude
,sea_surface_temp
andair_temp
as predictors and re-assign the result totao_imp
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Initialize missing values with hot-deck
tao_imp <- ___(tao)
# Create boolean masks for where air_temp and humidity are missing
missing_air_temp <- tao_imp$air_temp_imp
missing_humidity <- ___
for (i in 1:5) {
# Set air_temp to NA in places where it was originally missing and re-impute it
tao_imp$air_temp[missing_air_temp] <- NA
tao_imp <- impute_lm(tao_imp, air_temp ~ year + latitude + sea_surface_temp + humidity)
# Set humidity to NA in places where it was originally missing and re-impute it
tao_imp$humidity[___] <- ___
tao_imp <- ___(___, ___ ~ year + latitude + sea_surface_temp + ___)
}