Turning a heuristic into a classifier
You are surprised by the fact that heuristics can be so helpful. So you decide to treat the heuristic that "too many unique ports is suspicious" as a classifier in its own right. You achieve that by thresholding the number of unique ports per source by the average number used in bad source computers -- these are computers for which the label is True. The dataset is preloaded and split into training and test, so you have objects X_train, X_test, y_train and y_test in memory. Your imports include accuracy_score(), and numpy as np. To clarify: you won't be fitting a classifier from scikit-learn in this exercise, but instead you will define your own classification rule explicitly!
Diese Übung ist Teil des Kurses
Designing Machine Learning Workflows in Python
Anleitung zur Übung
- Subselect all bad hosts from
X_trainto form a new datasetX_train_bad. Note thaty_trainis a Boolean array. - Calculate the average of the
unique_portscolumn for bad hosts, and store it inavg_bad_ports. - Now consider a classifier that predicts as positive every example whose
unique_portsexceedavg_bad_ports. Save the predictions of this classifier on the test data on a new variable,pred_port. - Calculate this classifier's accuracy on the test data using
accuracy_score().
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# Create a new dataset X_train_bad by subselecting bad hosts
X_train_bad = ____[____]
# Calculate the average of unique_ports in bad examples
avg_bad_ports = np.____(____['unique_ports'])
# Label as positive sources that use more ports than that
pred_port = ____['unique_ports'] > ____
# Print the accuracy of the heuristic
print(____(y_test, ____))