Choosing the number of neighbors
k-Nearest-Neighbors (or kNN) imputation fills the missing values in an observation based on the values coming from the k other observations that are most similar to it. The number of these similar observations, called neighbors, that are considered is a parameter that has to be chosen beforehand.
How to choose k? One way is to try different values and see how they impact the relations between the imputed and observed data.
Let's try imputing humidity
in the tao
data using three different values of k
and see how the imputed values fit the relation between humidity
and sea_surface_temp
.
This exercise is part of the course
Handling Missing Data with Imputations in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Impute humidity using 30 neighbors
tao_imp <- ___(tao, k = ___, variable = ___)
# Draw a margin plot of sea_surface_temp vs humidity
tao_imp %>%
select(sea_surface_temp, humidity, humidity_imp) %>%
___(delimiter = "imp", main = "k = 30")