Turning a heuristic into a classifier
You are surprised by the fact that heuristics can be so helpful. So you decide to treat the heuristic that "too many unique ports is suspicious" as a classifier in its own right. You achieve that by thresholding the number of unique ports per source by the average number used in bad source computers -- these are computers for which the label is True
. The dataset is preloaded and split into training and test, so you have objects X_train
, X_test
, y_train
and y_test
in memory. Your imports include accuracy_score()
, and numpy
as np
. To clarify: you won't be fitting a classifier from scikit-learn in this exercise, but instead you will define your own classification rule explicitly!
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Subselect all bad hosts from
X_train
to form a new datasetX_train_bad
. Note thaty_train
is a Boolean array. - Calculate the average of the
unique_ports
column for bad hosts, and store it inavg_bad_ports
. - Now consider a classifier that predicts as positive every example whose
unique_ports
exceedavg_bad_ports
. Save the predictions of this classifier on the test data on a new variable,pred_port
. - Calculate this classifier's accuracy on the test data using
accuracy_score()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a new dataset X_train_bad by subselecting bad hosts
X_train_bad = ____[____]
# Calculate the average of unique_ports in bad examples
avg_bad_ports = np.____(____['unique_ports'])
# Label as positive sources that use more ports than that
pred_port = ____['unique_ports'] > ____
# Print the accuracy of the heuristic
print(____(y_test, ____))