1. What are anomalies and outliers?
Hi, I am Bex Tuychiev and I will be your instructor for this course on anomaly detection.
2. Inliers vs. outliers
Anomaly detection is a ubiquitous problem in data science and machine learning as the majority of real-world datasets have anomalies.
Anomaly detection involves classifying data into two categories: inliers, or what we consider normal data points, and outliers, which are rare observations that differ statistically and look inconsistent from the rest of the data.
3. Planet Earth as an anomaly
Life on Earth is a wonderful example of an outlier. Since no extraterrestrial life has been conclusively found in the Milky Way, we can consider all planets in the galaxy inliers, and the planet Earth an outlier.
4. Statistical definition
Statistically speaking, an outlier is a data point that has abnormal and statistically significant difference features from the rest of the data. Note that whichever method is used to find outliers, the final determination of whether a datapoint is an outlier is usually up to the observer.
5. Applications of anomaly detection
Anomaly detection has applications in many industries such as cyber security—finding security leaks and attacks,
6. Applications of anomaly detection
medicine—detecting tumors or cancerous cells,
7. Applications of anomaly detection
and finance and banking—such as detecting fraud.
In use cases such as these, anomaly detection techniques can be used to investigate whether extreme values negatively affect statistical analyses and models.
8. Example data
Mean and variance are two of the most common summary statistics, so it's important to appreciate how they can be affected by outliers.
Consider this pandas Series of 10 numbers, where 1289 can be clearly identified as an outlier.
9. Affected mean and variance
The data with the outlier shown has a five times greater mean and almost 1400 times greater variance than the data without the outlier.
10. Anomalies in training data
Anomalies can also create noise in training data. Machine learning models can treat them as a separate sub-group in the data because of their rarity and uniqueness. This can take the focus away from real patterns in the data, which can hurt model performance.
11. Outlier vs. novelty detection
Outlier detection should not be confused with novelty detection. Outlier detection finds outliers only in training data. In contrast, in novelty detection, we want to know if observations in the test set have the same distribution as the data we trained on. In other words, novelties only exist in the test set.
While both outlier and novelty detection are part of anomaly detection, this course focuses on outlier detection. Chapter one will be about univariate outliers, which exist in one-dimensional datasets or single distributions.
12. 5-number summary
Let's load the Big Mart Sales dataset and start outlier detection by looking at the 5-number summary of its sales column using the pd-dot-describe method.
If we look at the maximum value, it is six times higher than the mean, which raises suspicion.
13. Plot a histogram
Let's plot a histogram of Big Mart sales using plt-dot-hist. A rule of thumb to use for the number of bins is the square root of the number of observations.
14. The resulting histogram
The right tail shows some bars with a height of nearly zero, far off from the bulk of the histogram. This suggests they might be outliers.
15. Plot a scatterplot
Using a scatterplot, we can build on insights from the histogram. After creating a list of consecutive integers with the same length as our distribution, we plot a scatterplot of sales versus integers.
16. The resulting scatterplot
Again, we see many suspicious points that are around 10000.
By creating a figure with low height and high width inside the plt-dot-scatter function, we are able to see the range the normal datapoints lie in and also spot the outliers as the highest dots in the plot.
17. Let's practice
Now, let's practice!