Combining heuristics

A different cyber analyst tells you that during certain types of attack, the infected source computer sends small bits of traffic, to avoid detection. This makes you wonder whether it would be better to create a combined heuristic that simultaneously looks for large numbers of ports and small packet sizes. Does this improve performance over the simple port heuristic? As with the last exercise, you have X_train, X_test, y_train and y_test in memory. The sample code also helps you reproduce the outcome of the port heuristic, pred_port. You also have numpy as np and accuracy_score() preloaded.

Cet exercice fait partie du cours

Designing Machine Learning Workflows in Python

Afficher le cours

Instructions

The column average_packet computes the average packet size over all flows observed from a single source. Take the mean of those values for bad sources only on the training set.
Now construct a new rule which flags as positive all sources whose average traffic is less than the value above.
Combine the rules so that both heuristics have to simultaneously apply, using an appropriate arithmetic operation.
Report the accuracy of the combined heuristic.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Compute the mean of average_packet for bad sources
avg_bad_packet = np.mean(____[____]['average_packet'])

# Label as positive if average_packet is lower than that
pred_packet = ____[____] < avg_bad_packet

# Find indices where pred_port and pred_packet both True
pred_port = X_test['unique_ports'] > avg_bad_ports
pred_both = pred_packet ____ pred_port

# Ports only produced an accuracy of 0.919. Is this better?
print(accuracy_score(____, ____))

Modifier et exécuter le code

Designing Machine Learning Workflows in Python

AvancéNiveau de compétence

4.8+

65 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks