Get startedGet started for free

Synthetic Over-sampling

1. Synthetic Over-sampling

Hi! My name is Sebastiaan Höppner. I'm a PhD researcher at KU Leuven university. In this lesson we talk about synthetic minority over-sampling.

2. Over-sampling with 'SMOTE'

You will learn how to use the over-sampling technique SMOTE to decrease the imbalance within your dataset. SMOTE stands for synthetic minority over-sampling technique. The method over-samples the fraud class by creating synthetic fraud cases.

3. Example: credit transfer data

We illustrate SMOTE on the following dataset which contains 1000 transactions. We know the following details: whether the transfer was fraudulent or not, the amount that was transferred, the balance on the account, and the ratio between the transferred amount and the balance. Notice that our dataset contains only 1 percent of fraudulent transactions.

4. Look at the data (ratio vs amount)

Let's have a look at the data. Here we plot the ratio versus the amount of each transfer. The fraudulent ones are colored in red.

5. Focus on fraud cases

Now let's zoom in on the fraud cases.

6. SMOTE

...and select the fraud case indicated by X, which we call Tim.

7. SMOTE - step 1

In the first step, SMOTE finds the K nearest neighbors of Tim that are also fraudulent. K is a parameter that you can choose yourself. Let’s take K equal to 4 and indicate the four nearest fraudulent neighbors of Tim by X1, X2, X3 and X4.

8. SMOTE - step 2

In the second step, SMOTE randomly chooses one of these 4 fraudulent neighbors. Let’s say that point X4 is chosen and call it Bart.

9. SMOTE - step 3

In step 3, SMOTE will create a synthetic fraudulent sample. Here we have the attributes "amount" and "ratio" of both Tim and Bart.

10. SMOTE - step 3

First, SMOTE chooses a random number between 0 and 1, for example 0.6.

11. SMOTE - step 3

A synthetic fraud sample is then created as a linear combination between Tim and Bart. For this SMOTE computes the difference between the attributes of Tim and Bart. Next, it multiplies these differences with 0.6. The result is a synthetic transfer with an amount equal to 2782 and a ratio of 0.88.

12. SMOTE - step 3

This synthetic fraudulent transfer lies on the straight line between Tim and Bart. So, we successfully added a synthetic fraud case to our dataset.

13. SMOTE - step 4

Finally, SMOTE repeats the previous three steps a certain number of times for each real fraud case in the dataset. "dup_size" is a second parameter that you can choose. "dup_size" specifies how many times SMOTE should create a synthetic fraud sample for each real fraud case. Let’s choose "dup_size" equal to 10, so SMOTE adds 10 synthetic fraud cases for each real fraud case.

14. SMOTE on `transfer_data`

You can use the function "SMOTE" in the R package "smotefamily". We have to specify four inputs: X is the dataset with the numeric attributes and "target" is the response variable. Parameter K is the number of nearest neighbors and parameter "dup_size" specifies how many synthetic fraud cases should be created for each real fraud case. You can access the new over-sampled dataset by typing "dollar sign" "data". Our over-sampled dataset now contains 110 fraud cases and the percentage of fraud is 10 rather than only 1 percent.

15. Synthetic fraud cases

Here you see the result of SMOTE. Due to the over-sampling of the fraudulent transfers, more fraudulent transactions are present in the data. Therefore, it will be easier for anomaly detection techniques and classification methods to recognize patterns in the data. This will help detecting other fraudulent transfers in the future.

16. Let's practice!

Now it's your turn to use SMOTE!