Tuning the aggregation method
Once the optimal number of neighbors is found, it's time to tune the distance aggregating method
. If n_neighbors
is 10, each datapoint will have ten distance measurements to its nearest neighbors. KNN uses three methods to aggregate those distances: largest, mean, and median.
Find out which is best for the females_transformed
dataset. KNN
estimator, evaluate_outlier_classifier
and evaluate_regressor
functions are loaded for you.
Here are the function bodies as reminders:
def evaluate_outlier_classifier(model, data, threshold=.75):
model.fit(data)
probs = model.predict_proba(data)
inliers = data[probs[:, 1] <= threshold]
return inliers
def evaluate_regressor(inliers):
X, y = inliers.drop("weightkg", axis=1), inliers[['weightkg']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, train_size=0.8)
lr = LinearRegression()
lr.fit(X_train, y_train)
preds = lr.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
return round(rmse, 3)
This exercise is part of the course
Anomaly Detection in Python
Exercise instructions
- Loop over the product of
n_neighbors
andmethods
and instantiateKNN
with temporary variables ofk
andm
. - Find the inliers with the current
KNN
and a threshold of 50%. - Calculate RMSE and store the result into
scores
withk
,m
as key and RMSE as a value.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
n_neighbors = [5, 20]
methods = ['largest', 'mean', 'median']
scores = dict()
for k, m in ____:
# Create a KNN instance
knn = KNN(____, ____, n_jobs=-1)
# Find the inliers with the current KNN
inliers = ____
# Calculate and store RMSE into scores
scores[(k, m)] = ____
print(scores)