1. Dealing with imbalanced datasets
Now you know how to create interesting features based on recency, frequency and social networks. Before using a supervised method on your dataset, we’ll first present some popular techniques to deal with the imbalance problem that we typically have with fraud datasets.
2. Imbalanced data sets
The key challenge in fraud detection is correctly classifying events as fraud or not. This can be a major challenge for classification methods and anomaly detection techniques. When the dataset is imbalanced, a classifier tends to favour the majority class by labeling each case as legitimate. This results in a large classification error over the fraud cases. Classifiers tend to learn better from a balanced distribution.
3. Imbalanced data sets
A possible solution, therefore, is to change the class distribution of your dataset with so-called sampling methods.
4. Original imbalance
Suppose the dataset has this class distribution.
5. Over-sampling minority class...
We can resolve this imbalance by increasing the number of fraud cases. This is called over-sampling the minority class.
6. ... or under-sampling majority class ...
We can also reduce the number of legitimate cases in our dataset, which is called under-sampling the majority class.
7. ... or both!
Of course, we could combine both over-sampling and under-sampling.
8. Result after sampling...
A sampling method can be used to make the class distributions equal.
9. ... or like this
Or more-or-less equal because it's not guaranteed that an equal class distribution will lead to the best detection method.
10. Random over-sampling (ROS)
Let's focus on random over-sampling first. Consider this small dataset of 10 cases.
11. Random over-sampling (ROS)
As we'll see in a later chapter, we'll split our dataset in a training part on which we'll train a classification model, and a test part on which we'll test the performance of the trained model.
12. Random over-sampling (ROS)
Sampling methods are exclusively applied on the training set. Random over-sampling will increase the number of fraud cases in the training set,
13. Random over-sampling (ROS)
by copying randomly selected fraud cases multiple times to an over-sampled dataset. Again, keep in mind that for sampling methods, it is vital that you only sample the training set and not the test set.
14. Random over-sampling in practice
As an example we take the following dataset which can be found on Kaggle. The dataset contains about 300,000 transactions made by European credit cards during two days in September 2013. The features of the data consists of 28 anonymised numerical variables, a time variable, the transferred amount and the Class variable which indicates whether the transfer was fraudulent or not.
15. A look at (a subset of) the dataset
We only consider a subset of this large dataset. This figure provides a quick look at the data by plotting variable V2 against variable V1 and showing the fraud cases in red.
16. Check the imbalance
Our dataset seems to be imbalanced since it only consists of 2% fraud cases.
17. ovun.sample from ROSE package
Randomly over-sampling the fraud cases can be done with the function ovun.sample from the ROSE package. The original dataset contains 24108 legitimate cases. Suppose that our desired over-sampled dataset needs to contain 50% legitimate cases. The number of cases in the over-sampled dataset will therefore be equal to 24108/0.50 = 48216. When using ovun.sample, we specify the target variable, which is Class, the dataset, the method, which is over-sampling, and the number of cases of the over-sampled dataset. We also specify a seed such that we get the same result each time we execute the code.
The only thing left to do is to store the resulting dataset in an object which we call oversampled_credit. We see that the classes in the over-sampled dataset are indeed equally balanced.
18. A look at the over-sampled dataset
This figure shows the over-sampled dataset where the size of each point corresponds to how often the case occurs in the data.
19. Let's practice!
In the exercises, you will learn how to randomly over-sample fraud cases yourself.