Mettiamo tutto insieme

Ti sei appena unito a una startup che rileva le aritmie e vuoi addestrare un modello sull’insieme di dati delle aritmie arrh. Hai notato che le random forest vincono spesso diverse competizioni su Kaggle, quindi vuoi provarle con una profondità massima di 2, 5 o 10, usando una grid search. Inoltre, osservi che la dimensionalità dell’insieme di dati è piuttosto alta e vuoi valutare l’effetto di un metodo di selezione delle caratteristiche.

Per assicurarti di non fare overfitting per errore, hai già suddiviso i dati. Userai X_train e y_train per la grid search, e X_test e y_test per decidere se la selezione delle caratteristiche aiuta. Tutti e quattro i fold dell’insieme di dati sono già precaricati nel tuo ambiente. Hai anche accesso a GridSearchCV(), train_test_split(), SelectKBest(), chi2() e RandomForestClassifier come rfc.

Questo esercizio fa parte del corso

Progettare workflow di Machine Learning in Python

Visualizza il corso

Istruzioni dell'esercizio

Usa una grid search per provare una profondità massima di 2, 5 e 10 per RandomForestClassifier e salva l’impostazione di parametro con le prestazioni migliori.
Ora ri-adatta l’estimatore usando il numero di stimatori migliore individuato sopra.
Applica il selettore di caratteristiche SelectKBest con la funzione di scoring chi2 e ri-addestra il classificatore.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(
  ____(random_state=1), param_grid=____)
best_value = grid_search.____(
  ____, ____).best_params_['max_depth']

# Using the best value from above, fit a random forest
clf = rfc(
  random_state=1, ____=best_value).____(X_train, y_train)

# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(____, k=____).____(X_train, y_train)

# Create a new dataset only containing the selected features
X_train_reduced = ____.transform(____)

Modifica ed esegui il codice

Progettare workflow di Machine Learning in Python

AvançadoNível de habilidade

4.8+

87 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks