Natural hit rate
In this exercise, you'll again use credit card transaction data. The features and labels are similar to the data in the previous chapter, and the data is heavily imbalanced. We've given you features X
and labels y
to work with already, which are both numpy arrays.
First you need to explore how prevalent fraud is in the dataset, to understand what the "natural accuracy" is, if we were to predict everything as non-fraud. It's is important to understand which level of "accuracy" you need to "beat" in order to get a better prediction than by doing nothing. In the following exercises, you'll create our first random forest classifier for fraud detection. That will serve as the "baseline" model that you're going to try to improve in the upcoming exercises.
This exercise is part of the course
Fraud Detection in Python
Exercise instructions
- Count the total number of observations by taking the length of your labels
y
. - Count the non-fraud cases in our data by using list comprehension on
y
; remembery
is a NumPy array so.value_counts()
cannot be used in this case. - Calculate the natural accuracy by dividing the non-fraud cases over the total observations.
- Print the percentage.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Count the total number of observations from the length of y
total_obs = ____
# Count the total number of non-fraudulent observations
non_fraud = [i for ____ ____ ____ if i == 0]
count_non_fraud = non_fraud.count(0)
# Calculate the percentage of non fraud observations in the dataset
percentage = (float(____)/float(____)) * 100
# Print the percentage: this is our "natural accuracy" by doing nothing
____(____)