Natural hit rate
In this exercise, you'll again use credit card transaction data. The features and labels are similar to the data in the previous chapter, and the data is heavily imbalanced. We've given you features X
and labels y
to work with already, which are both numpy arrays.
First you need to explore how prevalent fraud is in the dataset, to understand what the "natural accuracy" is, if we were to predict everything as non-fraud. It's is important to understand which level of "accuracy" you need to "beat" in order to get a better prediction than by doing nothing. In the following exercises, you'll create our first random forest classifier for fraud detection. That will serve as the "baseline" model that you're going to try to improve in the upcoming exercises.
Diese Übung ist Teil des Kurses
Fraud Detection in Python
Anleitung zur Übung
- Count the total number of observations by taking the length of your labels
y
. - Count the non-fraud cases in our data by using list comprehension on
y
; remembery
is a NumPy array so.value_counts()
cannot be used in this case. - Calculate the natural accuracy by dividing the non-fraud cases over the total observations.
- Print the percentage.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Count the total number of observations from the length of y
total_obs = ____
# Count the total number of non-fraudulent observations
non_fraud = [i for ____ ____ ____ if i == 0]
count_non_fraud = non_fraud.count(0)
# Calculate the percentage of non fraud observations in the dataset
percentage = (float(____)/float(____)) * 100
# Print the percentage: this is our "natural accuracy" by doing nothing
____(____)