Assigning fraud versus non-fraud

1. Assigning fraud versus non-fraud

So how do you go from clustering your data to fraud detection? That's what you'll discover in this video.

2. Starting with clustered data

It all starts with your optimized model, it can be k-means or any other clustering method, for that matter. In a nutshell, you're going to take the outliers of each cluster, and flag those as fraud. In this example, you're looking at three clusters.

3. Assign the cluster centroids

In a first step, you need to collect and store the cluster centroids in memory, as that is the starting point to decide what's normal and what's not.

4. Define distances from the cluster centroid

The next step is to calculate the distance of each point in the dataset, to their own cluster centroid. In this case, I use the Euclidean distance, hence you see these depicted as round circles. You then also need to define a cut-off point for the distances to define what is an outlier. You do this based on the distributions of the distances collected. Suppose you decide everything that has a bigger distance than the top 95th percentile, should be considered an outlier, ie you take the tail of the distribution of distances. In this case, that would mean that anything that falls outside the round circles, is considered an outlier.

5. Flag fraud for those furthest away from cluster centroid

As you see in the example here, that means that you are indeed mostly flagging the odd samples that lie very far outside of the cluster centroids. These are definitely outliers and can thus be described as abnormal or suspicious. However, keep in mind that it doesn't necessarily mean that these observations are also fraudulent. They are, compared to the majority of normal behavior, just odd.

6. Flagging fraud based on distance to centroid

In python, the steps to do this are exactly the steps that I've just described in pictures. It all starts with your trained clustering model, in this case, k-means. You then need to assign to which cluster each data point belongs to with the predict function, and store those results. Next, you need to save the cluster-centers with this function. Then, it's time to calculate the distance of each data point to its cluster centroid. As you can see, I use the norm function from NumPy's linear algebra package, which returns the vector norm, ie the vector of distance for each data point to their assigned cluster. Last, you use the percentiles of the distances to determine which samples are outliers. Here, I take the 93rd percentile using NumPy's percentile function, and flag it with a one if it is bigger than that. Those are the final fraud predictions.

7. Validating your model results

Normally, this is where it gets difficult. If you don't have original fraud labels, you can't run the usual performance metrics, hence you need some other way to sense check your results. The best way to do so is to collaborate closely with your fraud expert, and let them have a look at the predictions and investigate further. Second, you want to understand why these cases are outliers. Are they truly fraudulent or just very rare cases of legit data in your sample? If it is just a rare but non-fraudulent cases, you can avoid that by deleting certain features, or removing those cases from the data altogether. If you do have some past cases of fraud, a good way is to see whether your model can actually predict those when you test your model on historic data. In the exercises, you'll use original fraud labels to check our model performance, but do keep in mind this is usually not possible.

8. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.