Exercise

Detecting outliers

In the next exercises, you're going to use the K-means algorithm to predict fraud, and compare those predictions to the actual labels that are saved, to sense check our results.

The fraudulent transactions are typically flagged as the observations that are furthest aways from the cluster centroid. You'll learn how to do this and how to determine the cut-off in this exercise. In the next one, you'll check the results.

Available are the scaled observations X_scaled, as well as the labels stored under the variable y.

Instructions

100 XP
  • Split the scaled data and labels y into a train and test set.
  • Define the MiniBatch K-means model with 3 clusters, and fit to the training data.
  • Get the cluster predictions from your test data and obtain the cluster centroids.
  • Define the boundary between fraud and non fraud to be at 95% of distance distribution and higher.