Initializing missing values & iterating over variables

As you have just seen, running impute_lm() might not fill-in all the missing values. To ensure you impute all of them, you should initialize the missing values with a simple method, such as the hot-deck imputation you learned about in the previous chapter, which simply feeds forward the last observed value.

Moreover, a single imputation is usually not enough. It is based on the basic initialized values and could be biased. A proper approach is to iterate over the variables, imputing them one at a time in the locations where they were originally missing.

In this exercise, you will first initialize the missing values with hot-deck imputation and then loop five times over air_temp and humidity from the tao data to impute them with linear regression. Let's get to it!

This exercise is part of the course

Handling Missing Data with Imputations in R

View Course

Exercise instructions

  • Initialize the missing values with the hotdeck() imputation.
  • Create a boolean mask for where humidity was originally missing and assign it to missing_humidity.
  • Inside the for-loop, set the humidity in tao_imp to NA in places where it was originally missing using the boolean mask you have created.
  • Inside the for-loop, impute humidity in tao_imp with linear regression, using year, latitude, sea_surface_temp and air_temp as predictors and re-assign the result to tao_imp.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Initialize missing values with hot-deck
tao_imp <- ___(tao)

# Create boolean masks for where air_temp and humidity are missing
missing_air_temp <- tao_imp$air_temp_imp
missing_humidity <- ___

for (i in 1:5) {
  # Set air_temp to NA in places where it was originally missing and re-impute it
  tao_imp$air_temp[missing_air_temp] <- NA
  tao_imp <- impute_lm(tao_imp, air_temp ~ year + latitude + sea_surface_temp + humidity)
  # Set humidity to NA in places where it was originally missing and re-impute it
  tao_imp$humidity[___] <- ___
  tao_imp <- ___(___, ___ ~ year + latitude + sea_surface_temp + ___)
}