Session Ready
Exercise

Turning a heuristic into a classifier

You are surprised by the fact that heuristics can be so helpful. So you decide to treat the heuristic that "too many unique ports is suspicious" as a classifier in its own right. You achieve that by thresholding the number of unique ports per source by the average number used in bad source computers -- these are computers for which the label is True. The dataset is preloaded and split into training and test, so you have objects X_train, X_test, y_train and y_test in memory. Your imports include accuracy_score(), and numpy as np. To clarify: you won't be fitting a classifier from scikit-learn in this exercise, but instead you will define your own classification rule explicitly!

Instructions
100 XP
  • Subselect all bad hosts from X_train to form a new dataset X_train_bad. Note that y_train is a Boolean array.
  • Calculate the average of the unique_ports column for bad hosts, and store it in avg_bad_ports.
  • Now consider a classifier that predicts as positive every example whose unique_ports exceed avg_bad_ports. Save the predictions of this classifier on the test data on a new variable, pred_port.
  • Calculate this classifier's accuracy on the test data using accuracy_score().