Detecting outliers

In the next exercises, you're going to use the K-means algorithm to predict fraud, and compare those predictions to the actual labels that are saved, to sense check our results.

The fraudulent transactions are typically flagged as the observations that are furthest aways from the cluster centroid. You'll learn how to do this and how to determine the cut-off in this exercise. In the next one, you'll check the results.

Available are the scaled observations X_scaled, as well as the labels stored under the variable y.

Este ejercicio forma parte del curso

Fraud Detection in Python

Ver curso

Instrucciones del ejercicio

Split the scaled data and labels y into a train and test set.
Define the MiniBatch K-means model with 3 clusters, and fit to the training data.
Get the cluster predictions from your test data and obtain the cluster centroids.
Define the boundary between fraud and non fraud to be at 95% of distance distribution and higher.

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

# Split the data into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size=0.3, random_state=0)

# Define K-means model 
kmeans = ____(n_clusters=____, random_state=42).fit(____)

# Obtain predictions and calculate distance from cluster centroid
X_test_clusters = ____.____(X_test)
X_test_clusters_centers = ____.____
dist = [np.linalg.norm(x-y) for x, y in zip(X_test, X_test_clusters_centers[X_test_clusters])]

# Create fraud predictions based on outliers on clusters 
km_y_pred = np.array(dist)
km_y_pred[dist >= np.percentile(dist, ____)] = 1
km_y_pred[dist < np.percentile(dist, ____)] = 0

Editar y ejecutar código