Get startedGet started for free

KNN imputation

1. KNN imputation

There are some problems with median imputation.

2. Dealing with missing values

It's very fast, but it can produce incorrect results if the input data has a systematic bias and is missing not-at-random. In other words, if there is a pattern in the data that leads to missing values, median imputation can miss this. It is therefore useful to explore other strategies for missing imputation, particularly for linear models. (Tree based models such as random forests tend to be more robust to the missing-not-at-random case). One useful type of missing value imputation is k-nearest-neighbors, or knn imputation. This is a strategy for imputing missing values based on other, "similar" non-missing rows. This method tries to overcome the missing-not-at-random problem by inferring what the missing value would be, based on observations that are similar in other, non-missing variables.

3. Example: missing not at random

Fortunately, the train function has a built-in method to do this. Let's make a dataset that has some missing-not-at-random data. We'll look at the mtcars dataset, and pretend that smaller cars (those with a lower displacement) don't report their horsepower. In this case, using median imputation will be incorrect. Since only medium and large sized cars report their horsepower, the median non-missing value for horsepower will be medium to large. This bias can lead to inaccurate models, as we're assuming the wrong value for horsepower in these small cars.

4. Example: missing not at random

Using knn imputation is much better, and will use the displacement and number of cylinders variables to make an educated guess as to the value of horsepower. This will tend to use the smaller cars with known horsepower to guess the missing values. This model is more accurate, with an RMSE of 3.56 vs 3.61 for the model that used median imputation, but it's a bit slower.

5. Let’s practice!

Let's explore knn imputation on some other datasets.