Introduction & Motivation

1. Introduction & Motivation

Welcome to this course on Fraud Detection! You will have three instructors.

2. Instructors

I am Bart Baesens, a professor in Analytics and Data Science at the Faculty of Economics and Business of KU Leuven, and a lecturer at the University of Southampton (UK).

3. Instructors

Tim Verdonck is a professor in Statistics and Data Science at the Department of Mathematics of KU Leuven (Belgium). He is chairholder of the BNP Paribas Fortis Chair in Fraud Analytics, of the Allianz Chair Prescriptive Business Analytics in Insurance.

4. Instructors

Sebastiaan Höppner is currently a PhD researcher at the Section of Statistics and Data Science of the Departement of Mathematics at the KU Leuven.

5. What is fraud?

Fraud can be defined as an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms.

6. Impact of fraud

Although fraud is rare, the cost of not detecting fraud can be huge. Here you can see some numbers to indicate the importance of the phenomenon. These examples show the need for organizations and governments to actively fight and prevent fraud.

7. Types of fraud

Here you can see some popular examples of fraud.

8. Key characteristics of successful fraud analytics models

For a fraud detection model to be successful, it needs to have the following characteristics. Statistical accuracy refers to the detection power and correctness of the model when cases are flagged as being suspicious.

9. Key characteristics of successful fraud analytics models

Interpretability refers to how well the model can be understood. In most settings, some level of understanding is required for the management to have confidence and allow the implementation of the model.

10. Key characteristics of successful fraud analytics models

A fraud analytical model should also comply with all applicable regulation and legislation, for example, with respect to privacy.

11. Key characteristics of successful fraud analytics models

The economical impact refers to the total cost of ownership and return on investment.

12. Key characteristics of successful fraud analytics models

Finally, classical expert-based fraud detection approaches are still in widespread use and definitely represent a good starting point and complementary tool to data-driven models. The choice between both should not be considered as man versus machine!

13. Challenges of fraud detection model

Various challenges arise when building analytical models for fraud detection. A major difficulty concerns the imbalance of the data, meaning that there are plenty of legitimate cases, but only very few fraudulent cases. For example in credit card transactions, typically less than 0.5% of the transactions are fraudulent. Such a problem is commonly referred to as the needle in a haystack problem, and might cause an analytical technique to experience difficulties in creating an accurate model.

14. Challenges of fraud detection model

Depending on the exact application, operational efficiency may be a key requirement. The fraud detection system might have only a limited amount of time available to reach a decision. In a credit card fraud detection setting, the decision time to let a transaction pass or not is typically less than eight seconds.

15. Challenges of fraud detection model

When flagging a good customer or its transaction as fraudulent, you risk losing this customer due to the harassment.

16. Imbalanced data

Consider the following example. After a major storm, an insurance company received many claims. Fraud investigators discovered that some claims were fraudulent which they labeled as one. Legitimate claims are labeled as zero. The percentage of fraud cases can be determined by using the functions table and prop.table.

17. Visualize imbalance with pie chart

We can visualize the ratio between good claims and fraud claims in a pie chart. Except for networks in the second chapter, data visualizations are out of the scope of this course and will be provided for you. You may refer to other DataCamp courses to learn more about dataviz.

18. Confusion matrix

A confusion matrix indicates the number of true positives, true negatives, false positives and false negatives.

19. Confusion matrix: claims example

Suppose no detection model is used, hence all filed claims are considered as legitimate. This means that all predicted labels equal zero. Next, we can compute the corresponding confusion matrix using the function confusionMatrix from the caret package. Although all 614 legitimate claims are identified correctly, none of the 14 fraud cases are detected. Despite not using a detection model, the reported accuracy is still 97.77%. This high number is entirely due to the small number of fraud cases compared to the large amount of legitimate claims. This illustrates that accuracy is not a good performance measure when the data is imbalanced.

20. Total cost of not detecting fraud: claims example

We define the total cost due to fraud simply as the sum of all fraud amounts.

21. Let's practice!

Now let's try some examples.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.