1. Labels, weak labels and truth
In the last lesson, you saw how one can engineer a feature vector from complex, multi-source datasets. Now we focus on the labels instead. As you will see, real-world labels are sometimes imperfect, and studying these imperfections can help you build a better classifier.
2. Labels are not always perfect
In the real world, things are not black and white. Instead, there are degrees of truth.
At the top of the scale, we have "ground truth": labels that cannot be disputed. For example, a virus might cause the computer to crash and ask for ransom money. There is no doubt that this computer was infected.
Next down is expert opinion: a cyber analyst might inspect a computer and determine it was compromised. Expert labels are usually, but not always right.
Finally, sometimes quick rules of thumb are available, also known as "heuristics": simple rules that are not very accurate but are better than nothing. For example, if a computer receives traffic to a very large number of ports in a very short time period, it might be under a portscan attack.
3. Labels are not always perfect
The quality of your labels acts like an upper bound on the real-world performance of your classifier. So it is important to be honest about that.
If your labels are produced from ground truth or expert opinion, you might be safe in treating them as perfectly accurate.
If instead your labels were produced using inaccurate heuristics, then such labels are often called "noisy", or "weak".
You might have to engineer certain custom features to implement these heuristics.
4. Features and heuristics
One example of a relevant feature that can also be used as a heuristic is the number of unique ports.
As you can see, its average value in bad traffic is 15.11. Here we used the Boolean vector of labels as an indicator.
In contrast, in background traffic it is around 11.23, significantly lower, which suggests this feature has diagnostic value.
5. From features to labels
You could hence apply a simple threshold on this feature to generate some more labels for our dataset.
These labels will not be as accurate as the expert labels, but they might still capture a different, useful signal.
Let's split our data into training and test, and label as "bad" any computer with more than 15 active ports. This results in two labels for each example, which might disagree with each other.
6. From features to labels
One way to deal with multiple labels is to augment the data by stacking two copies of the features on top of each other, and using ground truth to label one copy, and the weak labels for the other copy.
You can do that using the concatenation function from pandas.
7. Label weights
However, this would treat ground truth the same as weak labels, which we know is not right. To account for the difference in accuracy, we can place weights on examples.
For instance, we might say that each weakly labeled example carries half the importance of a ground truth example.
8. Weak labels
Most classifiers in scikit-learn support the use of weights via the sample_weight parameter. Here you can see a quick comparison. First, you could just use ground truth, which gives you an accuracy of 0.91.
Second, you could use expert labels and weak labels without weights, so treating them as equally important. This gets you to 0.93. The heuristic is helpful!
Adding the weights, makes an even bigger difference, bringing accuracy up to 0.95.
9. Labels do not need to be perfect!
In the next exercise, we will deal with some more inaccurate labels and try to make them more useful via the use of weights.