Get Started

Random under-sampling

1. Random under-sampling

Besides random over-sampling, we can also change the class distribution in a dataset with random under-sampling.

2. Random under-sampling (RUS)

Randomly under-sampling the regular cases will decrease the percentage of legitimate cases in the dataset.

3. Random under-sampling (RUS)

Here we have the same small dataset as in the previous video-exercise. After splitting the data in a training and test set, we'll lower the number of legitimate cases,

4. Random under-sampling (RUS)

by removing some of them from the training set by random.

5. Random under-sampling (RUS)

The class distribution in the resulting under-sampled training set will be more equally balanced.

6. A look at the imbalanced dataset

Remember the scatterplot of the original imbalanced dataset we showed in the previous video-exercise? Variable V2 is plotted against variable V1 and the fraudulent transfers are shown in red.

7. ovun.sample from ROSE package also for RUS

The function ovun.sample from the ROSE package can also be used for under-sampling. When specifying the number of cases in the desired under-sampled dataset, we have to divide the number of fraud cases in the original dataset, which is 492, by the percentage of fraud cases we like in the under-sampled dataset. Here we have chosen for 50% so the under-sampled dataset will contain 492/0.50 = 984 cases in total. The ovun.sample function can be used for under-sampling in the same way as with over-sampling, where now we set the parameter "method" equal to "under". After storing the under-sampled dataset as "undersampled_credit", we see that this dataset indeed contains an equal class distribution.

8. A look at the under-sampled dataset

This figure shows the result of randomly under-sampling the majority group.

9. Let's do both!

Of course, we can combine both over- and under-sampling.

10. Combination of over- & under-sampling

All you need to do is specify how many cases you want in the final dataset, and the fraction of fraud cases you want the new dataset to contain. In this example we choose the same number of cases as the original dataset which is 24600 and the fraction of fraud cases should be 50%. By setting parameter "method" to "both" the ovun.sample function will over-sample the fraud cases and under-sample the legitimate cases such that the sampled dataset contains 24600 cases of which approximately 50% are fraud.

11. Result!

This is the result of combining both random over-sampling and random under-sampling.

12. Let's practice!

In the exercises, you will learn how to use the ovun.sample function to the fullest such that you can do over-sampling, under-sampling or combine both!