From dataset to detection model

1. From dataset to detection model

Now let's build a fraud detection model and use SMOTE to improve the performance of the model.

2. Roadmap

The first step is to divide the dataset into a training set and a test set. Next, we have to choose a machine learning model. Then we use SMOTE to balance the class distribution in the training set and train the model on the training set. Finally, we test the model's performance on the test set.

3. Divide dataset in training & set

First, we split the dataset into two separate parts: a training set on which the model will be estimated and a test set which is only used for assessing the performance of the model. As an example, we've divided our credit transfer dataset into two equal parts. Notice how the training and test set have the same class distribution.

4. Choose & train machine learning model

There is a long list of machine learning models to choose from. As an example, we choose the CART algorithm for building a classification tree. This algorithm is in the "rpart" package as the "rpart" function.

5. A simple classification tree model

Building a decision tree on the imbalanced training set results in this rather simple classification tree. At the top of the figure, you see the branches of the tree which consists of splitting variables and splitting points. The leaves of the tree, also known as nodes, are at the bottom of the tree. The two shades of grey in each node represent the fraction of fraud and legitimate cases in the node.

6. Test performance on test set

We now use the trained model to predict the probability of fraud for each case in the test set. If the resulting score is larger than a threshold, say 50%, we classify the case as fraudulent. Otherwise, we consider it as legitimate. We use these predicted scores and classes to assess the accuracy of the model. For this we use the "confusionMatrix" function in the "caret" package. The high accuracy, however, may be misleading. The Area Under the ROC curve is a better performance metric which can be calculated using the pROC package.

7. Apply SMOTE on training set

Instead of building the model on the imbalanced training set, we could improve the performance by first over-sampling the fraud cases using SMOTE. The re-balanced training set now contains almost 20% fraud cases.

8. Train model on re-balanced training set

Building the decision tree on the re-balanced training set results in this model. This classification tree is more complex since it has more nodes and branches.

9. Test performance of new model on test set

We test the performance of the new model on the same test set. Notice how the accuracy has decreased a little bit, while the AUC has increased significantly.

10. Cost of deploying a detection model

Fraud is typically accompanied by financial losses. A model's performance can therefore be measured as the total cost of using the model. This total cost is based on costs associated with both misclassification errors and correct classifications.

11. Cost matrix

If a legitimate case is correctly identified as legitimate, then there are no costs involved.

12. Cost matrix

When a fraud case is misclassified as legitimate, then there are costs incurred like the stolen amount.

13. Cost matrix

When a legitimate case is wrongly identified as fraud, then the costs can be administration costs or costs for analyzing the case.

14. Cost matrix

Even when a fraud case is correctly detected, there are costs for analyzing and verifying the fraudulent nature of the transfer.

15. Cost measure for a detection model

By taking the sum over all cases and their predicted classes, we can compute the total cost of a detection model. The fixed costs for analyzing a case can be 10 dollars, 50 or even more.

16. True cost of fraud detection

Without SMOTE, the total cost of the model on the test set is slightly over 10,000 dollars. After using SMOTE, the total cost of the model is only 7400 dollars, which means a decrease of 26% in losses.

17. Let's practice!

We illustrated the impact of the SMOTE sampling method on the performance of a model. Now it's your turn to build a detection model!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.