Get startedGet started for free

DBSCAN

In this exercise you're going to explore using a density based clustering method (DBSCAN) to detect fraud. The advantage of DBSCAN is that you do not need to define the number of clusters beforehand. Also, DBSCAN can handle weirdly shaped data (i.e. non-convex) much better than K-means can. This time, you are not going to take the outliers of the clusters and use that for fraud, but take the smallest clusters in the data and label those as fraud. You again have the scaled dataset, i.e. X_scaled available. Let's give it a try!

This exercise is part of the course

Fraud Detection in Python

View Course

Exercise instructions

  • Import DBSCAN.
  • Initialize a DBSCAN model setting the maximum distance between two samples to 0.9 and the minimum observations in the clusters to 10, and fit the model to the scaled data.
  • Obtain the predicted labels, these are the cluster numbers assigned to an observation.
  • Print the number of clusters and the rest of the performance metrics.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import DBSCAN
from sklearn.cluster import ____

# Initialize and fit the DBSCAN model
db = DBSCAN(eps=____, min_samples=____, n_jobs=-1).fit(____)

# Obtain the predicted labels and calculate number of clusters
pred_labels = ____.____
n_clusters = len(set(pred_labels)) - (1 if -1 in labels else 0)

# Print performance metrics for DBSCAN
print('Estimated number of clusters: %d' % ____)
print("Homogeneity: %0.3f" % homogeneity_score(labels, pred_labels))
print("Silhouette Coefficient: %0.3f" % silhouette_score(X_scaled, pred_labels))
Edit and Run Code