Detecting outliers
In the next exercises, you're going to use the K-means algorithm to predict fraud, and compare those predictions to the actual labels that are saved, to sense check our results.
The fraudulent transactions are typically flagged as the observations that are furthest aways from the cluster centroid. You'll learn how to do this and how to determine the cut-off in this exercise. In the next one, you'll check the results.
Available are the scaled observations X_scaled
, as well as the labels stored under the variable y
.
Este exercício faz parte do curso
Fraud Detection in Python
Instruções do exercício
- Split the scaled data and labels
y
into a train and test set. - Define the MiniBatch K-means model with 3 clusters, and fit to the training data.
- Get the cluster predictions from your test data and obtain the cluster centroids.
- Define the boundary between fraud and non fraud to be at 95% of distance distribution and higher.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# Split the data into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size=0.3, random_state=0)
# Define K-means model
kmeans = ____(n_clusters=____, random_state=42).fit(____)
# Obtain predictions and calculate distance from cluster centroid
X_test_clusters = ____.____(X_test)
X_test_clusters_centers = ____.____
dist = [np.linalg.norm(x-y) for x, y in zip(X_test, X_test_clusters_centers[X_test_clusters])]
# Create fraud predictions based on outliers on clusters
km_y_pred = np.array(dist)
km_y_pred[dist >= np.percentile(dist, ____)] = 1
km_y_pred[dist < np.percentile(dist, ____)] = 0