Feature engineering op gegroepeerde data

Je bouwt nu voort op de vorige oefening met één extra feature: het aantal unieke protocollen dat door elke broncomputer wordt gebruikt. Met gegroepeerde data kun je features altijd op deze manier samenstellen: neem het aantal unieke elementen van alle categorische kolommen en het gemiddelde van alle numerieke kolommen als startpunt. Zoals eerder zijn flows al ingeladen, heb je cross_val_score() om nauwkeurigheid te meten, AdaBoostClassifier(), en pandas als pd en numpy als np beschikbaar.

Deze oefening maakt deel uit van de cursus

Machine Learning-workflows ontwerpen in Python

Cursus bekijken

Oefeninstructies

Pas een lambda-functie toe op de meegeleverde groepiterator om het aantal unieke protocollen per broncomputer te berekenen. Je kunt set() gebruiken om de kolom protocol te reduceren tot een set unieke waarden.
Zet het resultaat om naar een data frame met de juiste vorm door een index op te geven en de kolom protocol te noemen.
Concateneer het nieuwe data frame met het oude, dat beschikbaar is als X.
Beoordeel de nauwkeurigheid van AdaBoostClassifier() op deze nieuwe gegevensset met cross_val_score().

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Create a feature counting unique protocols per source
protocols = flows.groupby('source_computer').apply(
  lambda df: ____)

# Convert this feature into a dataframe, naming the column
protocols_DF = pd.DataFrame(
  protocols, index=____, columns=____)

# Now concatenate this feature with the previous dataset, X
X_more = pd.concat([X, ____], axis=____)

# Refit the classifier and report its accuracy
print(____(____(
  AdaBoostClassifier(), ____, y)))

Code bewerken en uitvoeren

Machine Learning-workflows ontwerpen in Python

SkillTag.level.advancedSkillTag.label

4.8+

87 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks