Plotting your data
From the previous exercise we know that the ratio of fraud to non-fraud observations is very low. You can do something about that, for example by re-sampling our data, which is explained in the next video.
In this exercise, you'll look at the data and visualize the fraud to non-fraud ratio. It is always a good starting point in your fraud analysis, to look at your data first, before you make any changes to it.
Moreover, when talking to your colleagues, a picture often makes it very clear that we're dealing with heavily imbalanced data.
Let's create a plot to visualize the ratio fraud to non-fraud data points on the dataset df
.
The function prep_data()
is already loaded in your workspace, as well as matplotlib.pyplot
as plt
.
This exercise is part of the course
Fraud Detection in Python
Exercise instructions
Define the
plot_data(X, y)
function, that will nicely plot the given feature setX
with labelsy
in a scatter plot. This has been done for you.Use the function
prep_data()
on your datasetdf
to create feature setX
and labelsy
.Run the function
plot_data()
on your newly obtainedX
andy
to visualize your results.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define a function to create a scatter plot of our data and labels
def plot_data(X, y):
plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
return plt.show()
# Create X and y from the prep_data function
X, y = prep_data(____)
# Plot our data by running our plot data function on X and y
____(X, y)