Session Ready
Exercise

Dealing with label noise

One of your cyber analysts informs you that many of the labels for the first 100 source computers in your training data might be wrong because of a database error. She hopes you can still use the data because most of the labels are still correct, but asks you to treat these 100 labels as "noisy". Thankfully you know how to do that, using weighted learning. The contaminated data is available in your workspace as X_train, X_test, y_train_noisy, y_test. You want to see if you can improve the performance of a GaussianNB() classifier using weighted learning. You can use the optional parameter sample_weight, which is supported by the .fit() methods of most popular classifiers. The function accuracy_score() is preloaded. You can consult the image below for guidance.

Instructions
100 XP
  • Fit an instance of GaussianNB() to the training data with the contaminated labels.
  • Report its accuracy on the test data using accuracy_score().
  • Create weights that assign twice as much weight to ground truth labels than to noisy labels. Remember that the weights concern the training data.
  • Refit the classifier using the above weights and report its accuracy.