1. k-Nearest-Neighbors imputation
Welcome back! In this lesson, we will discuss k-Nearest-Neighbors imputation.
2. k-Nearest-Neighbors imputation
Imagine we have a data set of three variables: A, B and C. There is a missing value in A that we would like to impute.
3. k-Nearest-Neighbors imputation
In k-Nearest-Neighbors (or kNN) imputation, to impute an incomplete observation, we look for a chosen number of k other observations, or neighbors, that are most similar to that observation. Here, we have picked k=3 neighbors, marked in green.
4. k-Nearest-Neighbors imputation
Then, we replace the missing values with the aggregated values from the k donors. Here, we have replaced the missing value in variable A with the mean of the donors' values for A: the mean of 42, 33 and 16 is 30.33.
The question is, how to choose the most similar donors. In order to do so, we need a measure of similarity, or distance, between observations.
5. Distance measures
The way we measure distance between two observations, say a and b, depends on the types of variables involved. For numeric variables, we use the Euclidean distance: we subtract the values of the two observations, square them, sum them across all numeric variables and finally take the squared root. This corresponds to a shortest line between two points in the 2D space. For factors, we use the Manhattan distance: the absolute value of the difference between the observations, summed across all factor variables. This means that we only look at how many levels of difference there is. Finally, for categorical variables, we use the Hamming distance, which is zero if the categories match in a variable and one otherwise. These ones and zeros are then summed for all variables.
6. Gower distance
Often, our data has many types of variables. What to do then?
7. Gower distance
We simply compute Euclidean distance for numeric variables, Manhattan distance for factors and Hamming distance for categorical variables, and then combine them together in an aggregated measure called the Gower distance. Don't worry, you won't have to do this yourself, the VIM's functions will do all the calculations for you. Let's see how it works.
8. kNN imputation in practice
Let's demonstrate the k-Nearest-Neighbors imputation on the nhanes data. We load the VIM package and call the "kNN" function. We need to specify the number of neighbors to use, k, and the variables to be imputed, here: "TotChol" and "Pulse". Let's look at the imputed data. Notice again, that the function outputs the binary indicators for the imputed variables.
9. Weighting donors
Out of the k neighbors for an observation, some are more similar to it than others. We might want to put more weight on closer neighbors when aggregating their values.
One way to do this is to aggregate with a weighted mean, with inverted distances as weights. The smaller the distance, the more similar the neighbor and the larger the weight. Of course, weighted mean can only work for numeric variables.
To implement this idea, we'll have to make to adjustments to the code from the previous slide. We set the numFun argument, the function to aggregate numeric variables, to "weighted.mean" and set the weightDist argument to TRUE, to use the distances as weights.
10. Sorting variables
Another important topic to discuss when talking about the kNN imputation is the order of variables. The kNN algorithm loops over variables, imputing them one by one. Each time the distances between observations are calculated. If the first variable had a lot of missing values, then the distance calculation for the second variable will be based on many imputed values, which were estimated with some noise. It is, therefore, good to sort the variables in ascending order by the number of missing values before running kNN. This way, all distance calculations will use as little imputed data as possible.
11. Sorting variables in practice
To implement this idea, we will first get the names of the variables sorted in ascending order by the number of missing values they contain. We compute the number of missing values per variable as usual, then sort, setting decreasing to FALSE, and finally extract names. Having done this, we simply select the sorted variables to reorder them before feeding the data to the "kNN" function.
12. Let's practice kNN imputation!
Let's practice kNN imputation!