Tuning n_neighbors
n_neighbors is the most crucial parameter of KNN. When you are unsure about the number of outliers in the dataset (which happens often), you can't use the rule of thumb that suggests using 20 neighbors when contamination is below 10%.
For such cases, you'll have to tune n_neighbors. Practice the process on the transformed version of the females dataset from the last exercise. It has been loaded as females_transformed. KNN estimator, evaluate_outlier_classifier and evaluate_regressor functions are also loaded.
Here are the function bodies as reminders:
def evaluate_outlier_classifier(model, data, threshold=.75):
model.fit(data)
probs = model.predict_proba(data)
inliers = data[probs[:, 1] <= threshold]
return inliers
def evaluate_regressor(inliers):
X, y = inliers.drop("weightkg", axis=1), inliers[['weightkg']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, train_size=0.8)
lr = LinearRegression()
lr.fit(X_train, y_train)
preds = lr.predict(X_test)
rmse = root_mean_squared_error(y_test, preds)
return round(rmse, 3)
Este ejercicio forma parte del curso
Anomaly Detection in Python
Instrucciones del ejercicio
- Create a list of possible values for
n_neighborsin that order: 5, 10, 20 - Instantiate a
KNNmodel, setting the value ofn_neighborsto the currentkin the loop. - Find the inliers using the
evaluate_outlier_classifierfunction. - Calculate RMSE with
evaluate_regressorand store the result intoscoreswithkas the key and RMSE as the value.
Ejercicio interactivo práctico
Prueba este ejercicio y completa el código de muestra.
# Create a list of values for n_neigbors
n_neighbors = [____, ____, ____]
scores = dict()
for k in n_neighbors:
# Instantiate KNN with the current k
knn = ____(____, n_jobs=-1)
# Find the inliers with the current KNN
inliers = ____(____, ____, .50)
# Calculate and store RMSE into scores
scores[____] = ____
print(scores)