Tuning n_neighbors

n_neighbors is the most crucial parameter of KNN. When you are unsure about the number of outliers in the dataset (which happens often), you can't use the rule of thumb that suggests using 20 neighbors when contamination is below 10%.

For such cases, you'll have to tune n_neighbors. Practice the process on the transformed version of the females dataset from the last exercise. It has been loaded as females_transformed. KNN estimator, evaluate_outlier_classifier and evaluate_regressor functions are also loaded.

Here are the function bodies as reminders:

def evaluate_outlier_classifier(model, data, threshold=.75):
    model.fit(data)

    probs = model.predict_proba(data)
    inliers = data[probs[:, 1] <= threshold]

    return inliers

def evaluate_regressor(inliers):
    X, y = inliers.drop("weightkg", axis=1), inliers[['weightkg']]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, train_size=0.8)

    lr = LinearRegression()
    lr.fit(X_train, y_train)

    preds = lr.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)

    return round(rmse, 3)

Create a list of possible values for n_neighbors in that order: 5, 10, 20
Instantiate a KNN model, setting the value of n_neighbors to the current k in the loop.
Find the inliers using the evaluate_outlier_classifier function.
Calculate RMSE with evaluate_regressor and store the result into scores with k as the key and RMSE as the value.

Detecting Univariate Outliers

Isolation Forests with PyOD

Distance and Density-based Algorithms

Time Series Anomaly Detection and Outlier Ensembles

Ejercicio

Tuning n_neighbors

Instrucciones