Get startedGet started for free

Is the source or the destination bad?

In the previous lesson, you used the destination computer as your entity of interest. However, your cybersecurity analyst just told you that it is the infected machines that generate the bad traffic, and will therefore appear as a source, not a destination, in the flows dataset.

The data flows has been preloaded, as well as the list bad of infected IDs and the feature extractor featurizer() from the previous lesson. You also have numpy available as np, AdaBoostClassifier(), and cross_val_score().

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

  • Create a data frame where each row is a feature vector for a source_computer. Group by source computer ID in the flows dataset and apply the feature extractor to each group.
  • Convert the iterator to a data frame by calling list() on it.
  • Create labels by checking whether each source_computer ID belongs in the list of bads you have been given.
  • Assess an AdaBoostClassifier() on this data using cross_val_score().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Group by source computer, and apply the feature extractor
out = flows.____('source_computer').____(featurize)

# Convert the iterator to a dataframe by calling list on it
X = pd.DataFrame(____, index=____)

# Check which sources in X.index are bad to create labels
y = [x in bads for x in ____]

# Report the average accuracy of Adaboost over 3-fold CV
print(np.mean(____(____, X, y)))
Edit and Run Code