Feature engineering on grouped data
You will now build on the previous exercise, by considering one additional feature: the number of unique protocols used by each source computer. Note that with grouped data, it is always possible to construct features in this manner: you can take the number of unique elements of all categorical columns, and the mean of all numeric columns as your starting point. As before, you have flows
preloaded, cross_val_score()
for measuring accuracy, AdaBoostClassifier()
, pandas
as pd
and numpy
as np
.
This exercise is part of the course
Designing Machine Learning Workflows in Python
Exercise instructions
- Apply a
lambda
function on the group iterator provided, to compute the number of unique protocols used by each source computer. You can useset()
to reduce theprotocol
column to a set of unique values. - Convert the result to a data frame with the right shape by providing an index and naming the column
protocol
. - Concatenate the new data frame with the old one, which is available as
X
. - Assess the accuracy of
AdaBoostClassifier()
on this new dataset usingcross_val_score()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a feature counting unique protocols per source
protocols = flows.groupby('source_computer').apply(
lambda df: ____)
# Convert this feature into a dataframe, naming the column
protocols_DF = pd.DataFrame(
protocols, index=____, columns=____)
# Now concatenate this feature with the previous dataset, X
X_more = pd.concat([X, ____], axis=____)
# Refit the classifier and report its accuracy
print(____(____(
AdaBoostClassifier(), ____, y)))