Get startedGet started for free

Hyperparameters of KNN

1. Hyperparameters of KNN

In this video, we will cover a few more KNN hyperparameters along with how to tune them.

2. Modify evaluate_outlier_classifier

We will continue using the evaluate_outlier_classifier and evaluate_regressor functions with modifications. First, we change evaluate_outlier_classifier to use outlier probabilities for filtering so that we don't have to tune contamination. Remember that this function fits any pyod model to the given data and returns the inliers. We add threshold as a parameter to the function definition so we can use it inside the body when filtering for outliers.

3. Modifying evaluate_regressor

In evaluate_regressor, which evaluates Linear Regression on the data with RMSE, we only change the target name, which is the weight in the kilograms column in this case.

4. Tuning the number of neighbors

Now, when you aren't sure about the number of outliers in the dataset (which happens often), you can't use the rule of thumb that suggests to use 20 neighbors when contamination is below 10%. For such cases, you'll have to tune `n_neighbors`. Let's start the process by creating a list called n_neighbors containing 5, 10, 15, and 20. We are choosing values below 20 because we know from the first video that the contamination is less than 1% for the males dataset. As always, we have an empty dictionary for scores as well. Inside the loop, we initialize KNN with the current k, run evaluate_classifier with a 55% threshold and store the results of evaluate_regressor into scores.

5. Inspecting the result

We see that 10 neighbors gives the lowest RMSE score.

6. Distance metrics

Once an optimal value for n_neighbors is determined, KNN calculates distances between each neighbor. We already saw one way of doing this using the euclidean function from the scipy-dot-spatial-dot-distance module. In fact, there are more than 40 distance algorithms in scipy and KNN accepts them all into its metric parameter. Even though euclidean distance is very popular, it does not work well with data beyond two or three dimensions. Besides, we always have to scale the data before using it.

7. Manhattan distance

After euclidean, the most popular metric is manhattan. To calculate it, we subtract coordinates of A from B's and find absolute values of differences rather than squares. Then, the distance is the final sum.

8. Manhattan distance

Manhattan distance works OK with high-dimensional data. It returns larger values than euclidean because it represents the distance between two vectors that can only be joined via right angles. So, manhattan distance is not the shortest path between them. The best use-case for it is when we have many one-hot encoded categorical features. Because the distance between datapoints is constructed only with right-angled paths, Manhattan represents the distance between categorical features more realistically.

9. Minkowski distance

Since the formulas of euclidean and manhattan distances are similar, they are combined into a single formula called minkowski distance, which is the default in pyod. Here, p=2 denotes euclidean distance and if we change it to one, the formula becomes manhattan distance. p can also take values higher than two and is tuned when Euclidean or Manhattan do not yield satisfactory results.

10. Distance aggregation

Once the distances between an instance and its k-nearest neighbors are calculated, they are averaged into a single value via three methods. The first one is choosing the distance to the farthest neighbor. This is the default behavior in KNN. If the method is mean or median, then the arithmetic mean or median of all distances is taken.

11. Tuning distance and method

Let's tune the distance metric and method parameters simultaneously. We create two lists: one for possible values of p as the degree for minkowski distance and another for the aggregation methods. The loop is written in the same way as before, only the parameters and temporary values are changing.

12. Inspecting the result

After tuning, we can see that three sets of hyperparameters return the lowest RMSE. If we have to choose one, we would go with Euclidean, which is p of two and the largest for the method.

13. Let's practice!

Now, let's practice!