Get startedGet started for free

Detecting outliers

In the next exercises, you're going to use the K-means algorithm to predict fraud, and compare those predictions to the actual labels that are saved, to sense check our results.

The fraudulent transactions are typically flagged as the observations that are furthest aways from the cluster centroid. You'll learn how to do this and how to determine the cut-off in this exercise. In the next one, you'll check the results.

Available are the scaled observations X_scaled, as well as the labels stored under the variable y.

This exercise is part of the course

Fraud Detection in Python

View Course

Exercise instructions

  • Split the scaled data and labels y into a train and test set.
  • Define the MiniBatch K-means model with 3 clusters, and fit to the training data.
  • Get the cluster predictions from your test data and obtain the cluster centroids.
  • Define the boundary between fraud and non fraud to be at 95% of distance distribution and higher.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Split the data into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size=0.3, random_state=0)

# Define K-means model 
kmeans = ____(n_clusters=____, random_state=42).fit(____)

# Obtain predictions and calculate distance from cluster centroid
X_test_clusters = ____.____(X_test)
X_test_clusters_centers = ____.____
dist = [np.linalg.norm(x-y) for x, y in zip(X_test, X_test_clusters_centers[X_test_clusters])]

# Create fraud predictions based on outliers on clusters 
km_y_pred = np.array(dist)
km_y_pred[dist >= np.percentile(dist, ____)] = 1
km_y_pred[dist < np.percentile(dist, ____)] = 0
Edit and Run Code