LoslegenKostenlos loslegen

kNN tricks & tips II: sorting variables

As the k-Nearest Neighbors algorithm loops over the variables in the data to impute them, it computes distances between observations using other variables, some of which have already been imputed in the previous steps. This means that if the variables located earlier in the data have a lot of missing values, then the subsequent distance calculation is based on a lot of imputed values. This introduces noise to the distance calculation.

For this reason, it is a good practice to sort the variables increasingly by the number of missing values before performing kNN imputation. This way, each distance calculation is based on as much observed data and as little imputed data as possible.

Let's try this out on the tao data!

Diese Übung ist Teil des Kurses

Handling Missing Data with Imputations in R

Kurs anzeigen

Anleitung zur Übung

  • Calculate the number of missing values in each column of tao in the first part of the pipeline.
  • Then, sort the variables increasingly according to the number of missing values, extract their names and assign the result to vars_by_NAs.
  • Use select() to reorder tao variables using the order saved in vars_by_NAs.
  • Perform k-Nearest Neighbors imputation on the reordered data and assign the result to tao_imp.

Interaktive Übung

Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.

# Get tao variable names sorted by number of NAs
vars_by_NAs <- tao %>%
  ___ %>%
  colSums() %>%
  sort(decreasing = ___) %>% 
  names()

# Sort tao variables and feed it to kNN imputation
tao_imp <- tao %>% 
  select(___) %>% 
  ___()

tao_imp %>% 
	select(sea_surface_temp, humidity, humidity_imp) %>% 
	marginplot(delimiter = "imp")
Code bearbeiten und ausführen