Omgaan met labelruis

Een van je cyberanalisten laat weten dat veel labels voor de eerste 100 broncomputers in je trainingsdata mogelijk onjuist zijn door een databasefout. Ze hoopt dat je de data toch kunt gebruiken, omdat de meeste labels nog kloppen, maar vraagt je deze 100 labels als "ruis" te behandelen. Gelukkig weet jij hoe dat moet, met gewogen leren. De besmette data staat klaar in je workspace als X_train, X_test, y_train_noisy, y_test. Je wilt kijken of je de prestaties van een GaussianNB()-classifier kunt verbeteren met gewogen leren. Je kunt de optionele parameter sample_weight gebruiken, die wordt ondersteund door de .fit()-methoden van de meeste populaire classifiers. De functie accuracy_score() is al ingeladen. Je kunt de onderstaande afbeelding raadplegen voor begeleiding.

Deze oefening maakt deel uit van de cursus

Machine Learning-workflows ontwerpen in Python

Cursus bekijken

Oefeninstructies

Fit een instantie van GaussianNB() op de trainingsdata met de besmette labels.
Rapporteer de nauwkeurigheid op de testdata met accuracy_score().
Maak gewichten die tweemaal zoveel gewicht toekennen aan ground-truthlabels als aan ruizige labels. Denk eraan: de gewichten hebben betrekking op de trainingsdata.
Fit de classifier opnieuw met bovenstaande gewichten en rapporteer de nauwkeurigheid.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Fit a Gaussian Naive Bayes classifier to the training data
clf = ____.____(____, y_train_noisy)

# Report its accuracy on the test data
print(accuracy_score(y_test, ____.____(X_test)))

# Assign half the weight to the first 100 noisy examples
weights = [____]*100 + [1.0]*(len(____)-100)

# Refit using weights and report accuracy. Has it improved?
clf_weights = GaussianNB().fit(X_train, y_train_noisy, ____=____)
print(accuracy_score(y_test, ____))

Code bewerken en uitvoeren

Machine Learning-workflows ontwerpen in Python

SkillTag.level.advancedSkillTag.label

4.8+

87 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks