Review of classification methods

1. Review of classification methods

In this video, you'll get a short recap on classification methods.

2. What is classification?

Classification is the problem of identifying to which class a new observation belongs, on the basis of a training set of data containing observations whose class is known. Classes are sometimes called targets, labels or categories. For example, spam detection in email service providers can be identified as a classification problem. This is a binary classification since there are only two classes as spam and not spam. Fraud detection is a classification problem, as we try to predict whether observations are fraudulent, yes or no. Lastly, assigning a diagnosis to a patient based on characteristics of a tumor, malignant or benign, is a classification problem. Classification problems normally have a categorical output like a yes or no, 1 or 0, True or False. In the case of fraud detection, the negative non-fraud class is the majority class, whereas the fraud cases are the minority class.

3. Classification methods commonly used for fraud detection

Logistic regression is one of the most used machine learning algorithms for binary classification. It is a simple algorithm that you can use as a performance baseline. It is easy to implement and it will do well enough in many tasks. It can also be adjusted to work reasonably well on highly imbalanced data, which makes it quite useful for fraud detection. I expect you to be familiar with this model, so I won't go into further details about how this model precisely works.

4. Classification methods commonly used for fraud detection

Neural networks can also be used as classifiers for fraud detection. They are capable of fitting highly non-linear models to our data. They tend to be slightly more complex to implement than most of the other classifiers we discuss in this course, so you'll not use this classifier in the exercises. Nonetheless, it's important to be aware that this is a model, suitable to use for fraud detection.

5. Classification methods commonly used for fraud detection

Decision trees and random forest and very commonly used in fraud detection. As you can see in the picture, decision trees give very transparent results, that are easily interpreted by fraud analysts. Nonetheless, they are prone to overfit to your data. Random forests are, therefore, a more robust option to use, as they construct a multitude of decision trees when training your model and outputting the class that is the mode or mean predicted class of all the individual trees.

6. Decision trees and random forests

To be more precise, a random forest consists of a collection of trees on a random subset of features. Final predictions are the combined results of those trees. Random forests can handle complex data and are not prone to overfit. They are interpretable by looking at feature importance, and can be adjusted to work well on highly imbalanced data. The only drawback is that they can be computationally quite heavy to run. Nonetheless, random forests are very popular for fraud detection. Throughout the exercises, you'll work on optimizing a random forest on our credit card fraud data.

7. Random forests for fraud detection

To refresh your memory, here is how to implement a simple random forest model using scikit-learn. First, you need to import the model from the ensemble package. Then, let's define the model here. The next step is to fit the model to your training set. Once that's done, you can obtain model predictions by running it on the test set. And lastly, you can obtain a measure of performance, let's take accuracy, by comparing your predictions to the actual labels under y_test.

8. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.