Alles samenbrengen

Je bent net begonnen bij een startup voor aritmie-detectie en wilt een model trainen op de aritmiegegevensset arrh. Je merkt dat random forests vaak goed scoren in Kaggle-wedstrijden, dus je wilt dat uitproberen met een maximale diepte van 2, 5 of 10 via een grid search. Je ziet ook dat de dimensie van de gegevensset vrij hoog is, dus je wilt het effect van een methode voor featureselectie overwegen.

Om te zorgen dat je niet per ongeluk overfit, heb je je data al gesplitst. Je gebruikt X_train en y_train voor de grid search, en X_test en y_test om te bepalen of featureselectie helpt. Alle vier de folds van de gegevensset zijn al voor je geladen. Je hebt ook toegang tot GridSearchCV(), train_test_split(), SelectKBest(), chi2() en RandomForestClassifier als rfc.

Deze oefening maakt deel uit van de cursus

Machine Learning-workflows ontwerpen in Python

Cursus bekijken

Oefeninstructies

Gebruik grid search om te experimenteren met een maximale diepte van 2, 5 en 10 voor RandomForestClassifier en sla de best presterende parameterinstelling op.
Refit nu de estimator met het best presterende aantal bomen zoals hierboven bepaald.
Pas de SelectKBest-featureselector toe met de chi2-scorefunctie en fit de classifier opnieuw.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(
  ____(random_state=1), param_grid=____)
best_value = grid_search.____(
  ____, ____).best_params_['max_depth']

# Using the best value from above, fit a random forest
clf = rfc(
  random_state=1, ____=best_value).____(X_train, y_train)

# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(____, k=____).____(X_train, y_train)

# Create a new dataset only containing the selected features
X_train_reduced = ____.transform(____)

Code bewerken en uitvoeren

Machine Learning-workflows ontwerpen in Python

SkillTag.level.advancedSkillTag.label

4.8+

87 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks