Plotting your data
From the previous exercise we know that the ratio of fraud to non-fraud observations is very low. You can do something about that, for example by re-sampling our data, which is explained in the next video.
In this exercise, you'll look at the data and visualize the fraud to non-fraud ratio. It is always a good starting point in your fraud analysis, to look at your data first, before you make any changes to it.
Moreover, when talking to your colleagues, a picture often makes it very clear that we're dealing with heavily imbalanced data.
Let's create a plot to visualize the ratio fraud to non-fraud data points on the dataset df.
The function prep_data() is already loaded in your workspace, as well as matplotlib.pyplot as plt.
This exercise is part of the course
Fraud Detection in Python
Exercise instructions
Define the
plot_data(X, y)function, that will nicely plot the given feature setXwith labelsyin a scatter plot. This has been done for you.Use the function
prep_data()on your datasetdfto create feature setXand labelsy.Run the function
plot_data()on your newly obtainedXandyto visualize your results.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define a function to create a scatter plot of our data and labels
def plot_data(X, y):
plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
return plt.show()
# Create X and y from the prep_data function
X, y = prep_data(____)
# Plot our data by running our plot data function on X and y
____(X, y)