Is the source or the destination bad?
In the previous lesson, you used the destination computer as your entity of interest. However, your cybersecurity analyst just told you that it is the infected machines that generate the bad traffic, and will therefore appear as a source, not a destination, in the flows
dataset.
The data flows
has been preloaded, as well as the list bad
of infected IDs and the feature extractor featurizer()
from the previous lesson. You also have numpy
available as np
, AdaBoostClassifier()
, and cross_val_score()
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Create a data frame where each row is a feature vector for a
source_computer
. Group by source computer ID in theflows
dataset and apply the feature extractor to each group. - Convert the iterator to a data frame by calling
list()
on it. - Create labels by checking whether each
source_computer
ID belongs in the list of bads you have been given. - Assess an
AdaBoostClassifier()
on this data usingcross_val_score()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Group by source computer, and apply the feature extractor
out = flows.____('source_computer').____(featurize)
# Convert the iterator to a dataframe by calling list on it
X = pd.DataFrame(____, index=____)
# Check which sources in X.index are bad to create labels
y = [x in bads for x in ____]
# Report the average accuracy of Adaboost over 3-fold CV
print(np.mean(____(____, X, y)))