Tuning contamination

Finally, it is time to tune the notorious contamination parameter. The evaluate_outlier_classifier and evaluate_regressor functions from the video are already loaded for you. You can inspect them below.

def evaluate_outlier_classifier(model, data):
    # Get labels
    labels = model.fit_predict(data)

    # Return inliers
    return data[labels == 0]

def evaluate_regressor(inliers):
    X = inliers.drop("price", axis=1)
    y = inliers[['price']]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

    lr = LinearRegression()
    lr.fit(X_train, y_train)

    preds = lr.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)

    return round(rmse, 3)

You will be using a sample of the US Airbnb Listings dataset, which has already been loaded as airbnb_df.

Create a list called contaminations that contains four values, 0.07, 0.1, 0.15, 0.25, and create an empty dictionary called scores to store the RMSE scores.

Detecting Univariate Outliers

Isolation Forests with PyOD

Distance and Density-based Algorithms

Time Series Anomaly Detection and Outlier Ensembles

Exercise

Tuning contamination

Instructions 1/4