CommencerCommencer gratuitement

Feature engineering on grouped data

You will now build on the previous exercise, by considering one additional feature: the number of unique protocols used by each source computer. Note that with grouped data, it is always possible to construct features in this manner: you can take the number of unique elements of all categorical columns, and the mean of all numeric columns as your starting point. As before, you have flows preloaded, cross_val_score() for measuring accuracy, AdaBoostClassifier(), pandas as pd and numpy as np.

Cet exercice fait partie du cours

Designing Machine Learning Workflows in Python

Afficher le cours

Instructions

  • Apply a lambda function on the group iterator provided, to compute the number of unique protocols used by each source computer. You can use set() to reduce the protocol column to a set of unique values.
  • Convert the result to a data frame with the right shape by providing an index and naming the column protocol.
  • Concatenate the new data frame with the old one, which is available as X.
  • Assess the accuracy of AdaBoostClassifier() on this new dataset using cross_val_score().

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Create a feature counting unique protocols per source
protocols = flows.groupby('source_computer').apply(
  lambda df: ____)

# Convert this feature into a dataframe, naming the column
protocols_DF = pd.DataFrame(
  protocols, index=____, columns=____)

# Now concatenate this feature with the previous dataset, X
X_more = pd.concat([X, ____], axis=____)

# Refit the classifier and report its accuracy
print(____(____(
  AdaBoostClassifier(), ____, y)))
Modifier et exécuter le code