1. Other clustering fraud detection methods
Apart from K-means clustering, there are many different clustering methods, which you can use for fraud detection.
2. There are many different clustering methods
Each clustering method has its pros and cons. K-means works well when your data is clustered in normal, round shapes. As you can see in this picture, when data is clustered in very different shapes, it does not perform so well. In this picture, you see the clustering method DBSCAN performing quite well, in fact.
3. And different ways of flagging fraud: using smallest clusters
Apart from other clustering methods, there are also other ways to flag fraud, not just based on cluster outliers. Rather than treating fraud as the oddball outlier in the existing clusters, you can also use the smallest clusters as an indication of fraud, as pictured here. You can use this approach when fraudulent behavior has commonalities, and thus will cluster together in your data. In that sense, you would expect it to cluster in tiny groups, rather than be the outliers in the larger clusters. We'll explore this more in the exercises.
4. In reality it looks more like this
The previous image was a perfect world example, but in reality, you will likely be looking at data that looks more like this. In this case, you see three obvious clusters, and a few dots that are clearly separate from the rest. As you can see, those smallest dots are outliers and outside of what you would describe as normal behavior. However, there are also medium to small clusters closely connected to the red cluster, so it's not very straightforward. In fact, if you can visualize your data with, for example, PCA, it can be quite helpful to do so.
5. DBSCAN versus K-means
So let's talk a bit more about DBSCAN. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. One benefit is that you do not need to predefine the number of clusters. The algorithm finds core samples of high density and expands clusters from them. This works well on data which contains clusters of similar density. This is a type of algorithm you can use to identify fraud as very small clusters. Things you do need to assign in the DBSCAN model are the maximum allowed distance between data within clusters, and the minimal number of data points in clusters. As you already saw before, DBSCAN performs well on weirdly shaped data, but is computationally much heavier than, for example, mini-batch K-means.
6. Implementing DBSCAN
Implementing DBSCAN is relatively straightforward. You start by defining the epsilon, eps. This is the distance between data points allowed from which the cluster expands. You also need to define the minimum samples in the cluster. Conventional DBSCAN cannot produce the optimal value of epsilon, and it requires sophisticated DBSCAN modifications to determine the optimal epsilon value automatically, which is beyond the scope of this course.
You need to fit DBSCAN to your scaled data. You can use the labels function to get the assigned cluster labels for each data point. You can also count the number of clusters by counting the unique cluster labels from the cluster label predictions. I use the length of the predicted labels here to do so, but you can do this in different ways.
7. Checking the size of the clusters
The DBSCAN model can also have performance metrics, such as the average silhouette score. Suppose you want to calculate the size of each cluster. You can use NumPy's bincount function for this. Bincount counts the number of occurrences of each value in a NumPy array, but only works on non-negative numbers. You can use this to calculate the size of each cluster.
From here, you can sort on size and decide how many of the smaller clusters you want to flag as fraud. This last bit is trial and error, and will also depend on how many fraud cases the fraud team can deal with on a regular basis.
8. Let's practice!
So, let's practice!