Get startedGet started for free

Labeled anomalies

1. Labeled anomalies

Sometimes a number of data examples contain known anomalies from past events. Often these are very rare, but they can be used to assess the performance of an anomaly detection algorithm. In this video, we will visually compare an anomaly score against a set of known anomaly labels.

2. Satellite image data

In this chapter, we'll explore a new dataset called sat, in which each row corresponds to one of 5803 satellite images. The first 5 rows are printed here using the head function. The first column is called label and contains binary values indicating the land use type in each image. If the land use is a cotton crop the label is 1, and 0 otherwise. The remaining 5 columns describe the pixel brightness in sections of the image.

3. Satellite image data

The table function has been used here to tally the total number of 0s and 1s occurring in the label column. Notice that there are only 71 images containing cotton crop, which is only 1 point 2 percent of the images, found by dividing through by 5803. Cotton crop images occur very rarely. Our job is to find an anomaly score based on pixel attributes that could be used to find these 71 points as accurately as possible.

4. Visualize true anomalies

It's helpful to first visualize the distribution of labels against some of the pixel attributes. The plot function has been used here to generate a scatterplot of the sat columns V2 and V3. Notice that by setting the col argument equal to as dot factor label, the color of the points corresponds to the value in the label column. In this case, a label value of 1 is shown as a red point and 0 as black. Can you see from the plot that many of the red points lie clear of the others?

5. Anomaly score versus true label

In this example, an isolation forest with 100 trees has been fitted to the satellite data, and a score column resulting from the predict function is appended to the sat dataframe. Notice that the anomaly label in the first column has been dropped from the input to the isolation forest algorithm. In the next step, a boxplot of the anomaly score is separated by the true label. Notice that the boxplot function accepts a formula argument, where the variable on the right of the tilde symbol is a category or a label. An ideal anomaly score might result in a boxplot where all true anomalies have higher scores and all other points have lower scores. The boxplot shown is typical, and it shows a few true anomalies with comparatively low scores and some normal points with higher scores.

6. Why not use models to predict labels?

Isolation forest and LOF quantify how unusual individual data examples are without knowing the true anomaly status. When labeled anomalies are available, it's natural to consider training a model to predict them. This is a perfectly reasonable thing to do, but has certain challenges. In the example of disease detection, the disease prevalence may be so low that without very large data sets, it's difficult to adequately predict the status of a new individual. In the exercises that follow, you'll investigate whether cases of thyroid disease can be detected from anomalous hormone measurements. A second example is that of credit card fraud. Fraudsters are dynamic, which means that they quickly adapt as exploits are detected and blocked. Consequently, models trained to predict fraud using historic data may not perform well in the future.

7. Let's practice!

Let's put this into practice!