1. Introduction to fraud detection
Welcome to this DataCamp course on fraud detection. To benefit from this course, you should be comfortable with manipulating dataframes, visualizing data, and have foundations in supervised and unsupervised learning.
2. Meet your instructor
I'll be your teacher throughout this course, and my name is Charlotte.
Did you know that a typical organization loses 5 percent of its revenue to fraud each year? In fact, it is estimated that fraud is costing the UK economy 73 billion British pounds each year. As such, you can say, fraud poses a serious problem to almost all companies. This course teaches you how you can tackle fraud as a data scientist, and thereby make a tangible impact on your company.
3. What is fraud?
Fraudulent behavior can be found in many different areas. Credit card fraud is perhaps the most famous example, and also in the insurance industry, fraud is a well-known issue. But it is much more broadly present than that. For example even all e-commerce businesses need to continuously assess whether client transactions on their website are legit.
Detecting fraud is typically challenging because of these four characteristics of fraud described here. First of all, fraud cases are in a minority, sometimes only one-hundredth percent of a companies' transactions are fraudulent. Fraudsters will also try their best to "blend" in and conceal their activities. Moreover, fraudsters will find new methods to avoid getting caught, and change their behavior over time. Lastly, fraudsters oftentimes work together and organize their activities in a network, making it harder to detect. It can be that multiple client accounts are involved around one fraud case. Let's illustrate this with an example.
4. Fraud detection is challenging
Have you ever played "Where is Waldo" or "Find the odd one out"? Like in the game, in fraud detection you'll need to train an algorithm to pick a well concealed observation out of many normal observations. Can you find the odd one out here?
5. Fraud detection is challenging
Here it is. It looks like the other clovers, but it deviates slightly. That one was easy, but it does get much harder when we're working with numbers, so have a look at this one.
6. Fraud detection is challenging
This is much more like in real life, we'll need to find a fraud case based on numbers. The case we're looking for is well concealed and only one of these is odd. Can you find it?
7. Fraud detection is challenging
Here it is, 26. It's the only number in this set that's not divisible by 4. This illustrates a typical fraud detection problem really well: based on data, you'll need to train an algorithm to find the odd one out among many normal observations.
8. How companies deal with fraud
As a data scientist working on fraud analytics, you'll often be asked to improve existing fraud detection systems. You'll maybe find that the company already uses a rules based system to filter out strange cases. Or that the fraud analytics team checks the news for suspicious names, or keeps track of external hit lists from the police to reference check against the client base. All these existing methods can be useful for your machine learning model, as you can use them as inputs in your analysis. But do be mindful when using labels that come out of existing rules based systems; you should always ask yourself whether the labels are reliable as they might not catch all fraudulent cases.
9. Let's have a look at some data
In this chapter we'll explore a dataset on credit card transactions. We have 29 features available, and a Class variable, containing information about whether the transaction is fraudulent or not. We have data on 5050 transactions in total. This should be enough for training our first algorithm on.
10. Let's practice!
Now let's have a look at this credit card data in more detail!