Tuning n_neighbors
n_neighbors
is the most crucial parameter of KNN
. When you are unsure about the number of outliers in the dataset (which happens often), you can't use the rule of thumb that suggests using 20 neighbors when contamination is below 10%.
For such cases, you'll have to tune n_neighbors
. Practice the process on the transformed version of the females
dataset from the last exercise. It has been loaded as females_transformed
. KNN
estimator, evaluate_outlier_classifier
and evaluate_regressor
functions are also loaded.
Here are the function bodies as reminders:
def evaluate_outlier_classifier(model, data, threshold=.75):
model.fit(data)
probs = model.predict_proba(data)
inliers = data[probs[:, 1] <= threshold]
return inliers
def evaluate_regressor(inliers):
X, y = inliers.drop("weightkg", axis=1), inliers[['weightkg']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, train_size=0.8)
lr = LinearRegression()
lr.fit(X_train, y_train)
preds = lr.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
return round(rmse, 3)
This exercise is part of the course
Anomaly Detection in Python
Exercise instructions
- Create a list of possible values for
n_neighbors
in that order: 5, 10, 20 - Instantiate a
KNN
model, setting the value ofn_neighbors
to the currentk
in the loop. - Find the inliers using the
evaluate_outlier_classifier
function. - Calculate RMSE with
evaluate_regressor
and store the result intoscores
withk
as the key and RMSE as the value.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a list of values for n_neigbors
n_neighbors = [____, ____, ____]
scores = dict()
for k in n_neighbors:
# Instantiate KNN with the current k
knn = ____(____, n_jobs=-1)
# Find the inliers with the current KNN
inliers = ____(____, ____, .50)
# Calculate and store RMSE into scores
scores[____] = ____
print(scores)