Get startedGet started for free

Turning a heuristic into a classifier

You are surprised by the fact that heuristics can be so helpful. So you decide to treat the heuristic that "too many unique ports is suspicious" as a classifier in its own right. You achieve that by thresholding the number of unique ports per source by the average number used in bad source computers -- these are computers for which the label is True. The dataset is preloaded and split into training and test, so you have objects X_train, X_test, y_train and y_test in memory. Your imports include accuracy_score(), and numpy as np. To clarify: you won't be fitting a classifier from scikit-learn in this exercise, but instead you will define your own classification rule explicitly!

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

  • Subselect all bad hosts from X_train to form a new dataset X_train_bad. Note that y_train is a Boolean array.
  • Calculate the average of the unique_ports column for bad hosts, and store it in avg_bad_ports.
  • Now consider a classifier that predicts as positive every example whose unique_ports exceed avg_bad_ports. Save the predictions of this classifier on the test data on a new variable, pred_port.
  • Calculate this classifier's accuracy on the test data using accuracy_score().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a new dataset X_train_bad by subselecting bad hosts
X_train_bad = ____[____]

# Calculate the average of unique_ports in bad examples
avg_bad_ports = np.____(____['unique_ports'])

# Label as positive sources that use more ports than that
pred_port = ____['unique_ports'] > ____

# Print the accuracy of the heuristic
print(____(y_test, ____))
Edit and Run Code