Session Ready
Exercise

Feature engineering on grouped data

You will now build on the previous exercise, by considering one additional feature: the number of unique protocols used by each source computer. Note that with grouped data, it is always possible to construct features in this manner: you can take the number of unique elements of all categorical columns, and the mean of all numeric columns as your starting point. As before, you have flows preloaded, cross_val_score() for measuring accuracy, AdaBoostClassifier(), pandas as pd and numpy as np.

Instructions
100 XP
  • Apply a lambda function on the group iterator provided, to compute the number of unique protocols used by each source computer. You can use set() to reduce the protocol column to a set of unique values.
  • Convert the result to a data frame with the right shape by providing an index and naming the column protocol.
  • Concatenate the new data frame with the old one, which is available as X.
  • Assess the accuracy of AdaBoostClassifier() on this new dataset using cross_val_score().