Get startedGet started for free

Feature engineering on grouped data

You will now build on the previous exercise, by considering one additional feature: the number of unique protocols used by each source computer. Note that with grouped data, it is always possible to construct features in this manner: you can take the number of unique elements of all categorical columns, and the mean of all numeric columns as your starting point. As before, you have flows preloaded, cross_val_score() for measuring accuracy, AdaBoostClassifier(), pandas as pd and numpy as np.

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

  • Apply a lambda function on the group iterator provided, to compute the number of unique protocols used by each source computer. You can use set() to reduce the protocol column to a set of unique values.
  • Convert the result to a data frame with the right shape by providing an index and naming the column protocol.
  • Concatenate the new data frame with the old one, which is available as X.
  • Assess the accuracy of AdaBoostClassifier() on this new dataset using cross_val_score().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a feature counting unique protocols per source
protocols = flows.groupby('source_computer').apply(
  lambda df: ____)

# Convert this feature into a dataframe, naming the column
protocols_DF = pd.DataFrame(
  protocols, index=____, columns=____)

# Now concatenate this feature with the previous dataset, X
X_more = pd.concat([X, ____], axis=____)

# Refit the classifier and report its accuracy
print(____(____(
  AdaBoostClassifier(), ____, y)))
Edit and Run Code