Gestire il rumore nelle etichette

Una tua analista di cybersecurity ti informa che molte delle etichette per i primi 100 computer sorgente nei dati di training potrebbero essere sbagliate a causa di un errore nel database. Sperano che tu possa comunque usare i dati perché la maggior parte delle etichette è ancora corretta, ma ti chiedono di considerare queste 100 etichette come "rumorose". Per fortuna sai come gestirle, usando l'apprendimento pesato. I dati contaminati sono disponibili nel tuo workspace come X_train, X_test, y_train_noisy, y_test. Vuoi verificare se puoi migliorare le prestazioni di un classificatore GaussianNB() usando l'apprendimento pesato. Puoi usare il parametro opzionale sample_weight, supportato dai metodi .fit() della maggior parte dei classificatori più diffusi. La funzione accuracy_score() è già caricata. Puoi consultare l'immagine qui sotto come guida.

Questo esercizio fa parte del corso

Progettare workflow di Machine Learning in Python

Visualizza il corso

Istruzioni dell'esercizio

Allena un'istanza di GaussianNB() sui dati di training con etichette contaminate.
Riporta la sua accuratezza sui dati di test usando accuracy_score().
Crea pesi che assegnino il doppio del peso alle etichette ground truth rispetto a quelle rumorose. Ricorda che i pesi si riferiscono ai dati di training.
Allena di nuovo il classificatore usando i pesi sopra e riporta la sua accuratezza.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Fit a Gaussian Naive Bayes classifier to the training data
clf = ____.____(____, y_train_noisy)

# Report its accuracy on the test data
print(accuracy_score(y_test, ____.____(X_test)))

# Assign half the weight to the first 100 noisy examples
weights = [____]*100 + [1.0]*(len(____)-100)

# Refit using weights and report accuracy. Has it improved?
clf_weights = GaussianNB().fit(X_train, y_train_noisy, ____=____)
print(accuracy_score(y_test, ____))

Modifica ed esegui il codice

Progettare workflow di Machine Learning in Python

AvançadoNível de habilidade

4.8+

87 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks