1. Clustering methods to detect fraud
Let's do a quick refresher on K-means clustering.
2. Clustering: trying to detect patterns in data
The objective of any clustering model is to detect patterns in your data. More specifically, to group your data in distinct clusters, that is made up of data points that are very similar to each other, but distinct from the data points in the other clusters.
We can use this for fraud detection to determine which data looks very similar to the data in the clusters, and which data you would have a hard time assigning to any cluster. You can flag such data as odd, or suspicious. In this image you see a clear example where a cloud of data is clustered into three distinct clusters.
3. K-means clustering: using the distance to cluster centroids
So, let's talk about how we achieve this pattern detecting using K-means clustering. In this example, training samples are shown as dots and cluster centroids are shown as crosses. Let's say we try to cluster the data in image A.
4. K-means clustering: using the distance to cluster centroids
We start by putting in an initial guess for two cluster centroids in figure B. You need to predefine the amount of clusters, therefore, at the start.
5. K-means clustering: using the distance to cluster centroids
You then calculate the distances of each sample in the data to the closest centroid, in figure C, which allows you to split your data into the first two clusters.
6. Step 3
And based on these initial clusters, you can refine the location of the centroids to minimize the sum of all distances in the two clusters, as you can see here in picture D.
7. Step 4
You then repeat the step of reassigning points that are nearest to the centroid, as shown in figure E, and so forth
8. Insert title here...
until it converges to the point where no sample gets reassigned to another cluster. The final clusters are depicted in picture F.
9. K-means clustering in Python
Let's see how to implement this in Python. You begin by importing the K-means model from scikit-learn, and also a scaling method. It is of utmost importance to scale your data before doing K-means clustering, or any algorithm that uses distances, for that matter. If you forget to scale, features on a larger scale will weigh more heavily in the algorithm, and you don't want that. All features should weigh equally at this point.
In the first step, you transform the data stored under df, into a NumPy array and make sure all the data is of the type float.
Second, you apply the MinMaxScaler and use fit_transform on the data, as this returns the scaled data.
Now you are ready to define the K-means model with 6 clusters, and fit that straight to the scaled data, as seen here. It is wise to fix the random-state, to be able to compare models.
10. The right amount of clusters
The drawback of K-means clustering is that you need to assign the number of clusters beforehand. There are multiple ways to check what the right amount of clusters should be, such as the silhouette method or the elbow curve. Let's do a quick refresher on the elbow curve.
The objective of k-means is to minimize the sum of all distances between the data samples and their associated cluster centroids. The score is the inverse of that minimization, so you want the score to be close to zero. By running a k-means model on clusters varying from 1 to 10, like this, and saving the scores for each model under score, you can obtain the elbow curve. Then it is a matter of simply plotting the scores against the number of clusters like this. Which results in the following plot.
11. The elbow curve
This is an example of a typical elbow curve. The slight angle at K equals 3 suggests that 3 clusters could be optimal, although the optimal cluster number is not very pronounced in this case.
12. Let's practice!
Let's practice!