Feature engineering on grouped data

You will now build on the previous exercise, by considering one additional feature: the number of unique protocols used by each source computer. Note that with grouped data, it is always possible to construct features in this manner: you can take the number of unique elements of all categorical columns, and the mean of all numeric columns as your starting point. As before, you have flows preloaded, cross_val_score() for measuring accuracy, AdaBoostClassifier(), pandas as pd and numpy as np.

Cet exercice fait partie du cours

Designing Machine Learning Workflows in Python

Afficher le cours

Instructions

Apply a lambda function on the group iterator provided, to compute the number of unique protocols used by each source computer. You can use set() to reduce the protocol column to a set of unique values.
Convert the result to a data frame with the right shape by providing an index and naming the column protocol.
Concatenate the new data frame with the old one, which is available as X.
Assess the accuracy of AdaBoostClassifier() on this new dataset using cross_val_score().

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Create a feature counting unique protocols per source
protocols = flows.groupby('source_computer').apply(
  lambda df: ____)

# Convert this feature into a dataframe, naming the column
protocols_DF = pd.DataFrame(
  protocols, index=____, columns=____)

# Now concatenate this feature with the previous dataset, X
X_more = pd.concat([X, ____], axis=____)

# Refit the classifier and report its accuracy
print(____(____(
  AdaBoostClassifier(), ____, y)))

Modifier et exécuter le code

Designing Machine Learning Workflows in Python

AvancéNiveau de compétence

4.8+

55 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks