LoslegenKostenlos loslegen

Detecting outliers

In the next exercises, you're going to use the K-means algorithm to predict fraud, and compare those predictions to the actual labels that are saved, to sense check our results.

The fraudulent transactions are typically flagged as the observations that are furthest aways from the cluster centroid. You'll learn how to do this and how to determine the cut-off in this exercise. In the next one, you'll check the results.

Available are the scaled observations X_scaled, as well as the labels stored under the variable y.

Diese Übung ist Teil des Kurses

Fraud Detection in Python

Kurs anzeigen

Anleitung zur Übung

  • Split the scaled data and labels y into a train and test set.
  • Define the MiniBatch K-means model with 3 clusters, and fit to the training data.
  • Get the cluster predictions from your test data and obtain the cluster centroids.
  • Define the boundary between fraud and non fraud to be at 95% of distance distribution and higher.

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Split the data into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size=0.3, random_state=0)

# Define K-means model 
kmeans = ____(n_clusters=____, random_state=42).fit(____)

# Obtain predictions and calculate distance from cluster centroid
X_test_clusters = ____.____(X_test)
X_test_clusters_centers = ____.____
dist = [np.linalg.norm(x-y) for x, y in zip(X_test, X_test_clusters_centers[X_test_clusters])]

# Create fraud predictions based on outliers on clusters 
km_y_pred = np.array(dist)
km_y_pred[dist >= np.percentile(dist, ____)] = 1
km_y_pred[dist < np.percentile(dist, ____)] = 0
Code bearbeiten und ausführen