Get startedGet started for free

Using z-scores for Anomaly Detection

1. Using z-scores for Anomaly Detection

In this video, we will take a look at two more univariate anomaly detection methods: z-scores and Median Absolute Deviation.

2. What are z-scores?

The z-score of a sample drawn from a normal distribution tells us how many standard deviations the sample is away from the mean. For example, in a distribution with a mean of 10 and an STD of three, the sample 16-point-3 would have a z-score of 2-point-1.

3. The Empirical Rule and outliers

The idea of using z-scores for outlier detection comes from the Empirical Rule. The Empirical Rule helps us remember the percentage of values that lie within one, two, and three standard deviations of the mean, which are 68, 95, and 99-point-7 percent, respectively. Values outside the three standard deviation range are considered extremes as they go into the far left and right tails (the pink areas) of the normal curve. Based on this, it is common practice to use a z-score of three as a threshold for filtering outliers in normally distributed data. Let's see how that works in code.

4. Z-scores in code

In Python, we can use the zscore function from scipy. It takes an array of data and returns a z-score for every value in the dataset.

5. Z-scores in code

Next, we create a boolean mask that checks if the absolute values of the z-scores are above three. The output is a boolean series, with the first five rows as the preview.

6. Z-scores in code

Finally, we use this mask to subset sales for outliers. We find 90 outliers using z-scores.

7. Drawbacks of z-scores

While powerful, it's important to understand the limitations of the z-score method. First, it is only appropriate when the data comes from a normal distribution. Second, z-scores use the very metrics which are influenced by outliers the most: the mean and standard deviation. Their effectiveness suffers greatly if there are too many outliers in the data, which skew the mean and standard deviation.

8. Median Absolute Deviation (MAD)

An alternative is the modified z-scores algorithm. Under the hood, it uses a score called Median Absolute Deviation (MAD). Like standard deviation, MAD is a measure of dispersion but is much more outlier resistant as it uses the median at its core. Let's calculate the score in scipy.

9. MAD score

We import the median_abs_deviation function from scipy-dot-stats module and call it on the sales distribution. Once we have this MAD value, we can use it to replace the standard deviation in the z-scores formula. Instead of asking "How many standard deviations away from the mean?", we ask, "How many median absolute deviations, or MADs, away from the median?".

10. Introduction to PyOD

To insert the MAD score into the modified z-scores algorithm, we use the Python Outlier Detection (pyod) package, a popular anomaly detection library in the Python ecosystem that includes more than 40 algorithms implemented with sklearn-like syntax. We will use pyod extensively, starting with the MAD estimator.

11. Modified z-scores in code

We import pyod-dot-models-dot-mad and initialize it with a default and recommended threshold of 3-point-5. The estimator calculates the MAD score under the hood and will mark any points beyond 3.5 MAD scores as outliers. Next, we convert the sales column into a NumPy array and reshape it to a 2D array, which like scikit-learn, is required by all pyod models.

12. Modified z-scores in code

Then, we use mad's fit_predict method to generate inlier and outlier labels for sales_reshaped. The MAD estimator returns 0 for inliers and 1 for outliers. We find 83 rather than 90 outliers using modified z-scores and can trust that this result is more robust.

13. Let's practice!

Now, let's practice!