1. Anomaly detection with window functions
In this final chapter, we'll revisit the concepts of standard deviation before diving deeper into what rolling standard deviation is, upper and lower control limits, and how Z-score sensitivity helps us with anomaly detection. Let's get started.
2. Standard deviation versus rolling standard deviation
As we've previously covered, standard deviation is a measure of variance or dispersion in a population of values.
High dispersion means high variance, and low dispersion, low variance. This can be seen from the box plot where the higher interquartile range represents high variance, and the lower interquartile range, low variance.
Rolling standard deviation is a technique to identify variance inflation, over a specified time window.
In other words, as the time window shifts, and the variance increases - this might signal an anomalous period for further investigation.
We can see this in our line chart where the rolling standard deviation is smooth up until what appears to be a highly anomalous event at the end of the time series.
3. Standard deviation and anomaly detection
Assuming your data is normally distributed, we can apply the 68, 95 and 99 rule.
We can see this in our visual to the right, where the pink area is 1 standard deviation away from the mean which contains ~ 68% of all values. The purple represents two standard deviations which contains 95% of all values.
Lastly, the green represents 3 standard deviations which contains 99.7% of all the values.
Any value outside of this is an anomaly.
4. Upper and lower control limits
Now, is there a formal way of applying these 3 standard deviation limits against our data?
Yes - these are known as upper and lower limits.
They're particularly useful for analyzing univariate time series.
This isn't applicable for multivariate analysis as each univariate signal will have its own range of acceptable values.
These are formally used in visuals known as control charts that provide an effective way to visualize the upper and lower limits for the respective time series we're analyzing.
Formally, the upper limit is defined as any value greater than the population mean plus 3 standard deviations away. And similarly, the lower limit is the inverse, where this is the population mean minus 3 standard deviations.
5. What are Z-scores?
So we've formally got upper and lower limits that help us identify a cut-off threshold.
However, we can also make use of Z-scores to formally identify anomalies.
Formally, the Z-score is the number of standard deviations a given data point lies above or below the mean.
A positive Z-score indicates the value is above the mean whereas a negative Z-score indicates the value is below the mean.
Lastly, the Z-score is separate from the standard deviation that measures the distance between data points.
As we can see from the image ,the further we are away from the population mean, the higher our negative Z-score and vice-versa.
6. Z-scores and anomaly detection
Now whilst Z-scores of three and above are considered anomalous, this is subject to user context.
If we look at our first Z score image, we can see how a Z-score of one has flagged a number of anomalies.
However, when this is then increased to 2, we can see fewer anomalous values are flagged.
This isn't a bad thing, but it depends upon how sensitive you want your anomaly detection to be.
If you set it too high, you'll miss too many anomalies, if you set it too low, you'll have to analyze too many.
Just remember, this sensitivity is something that is contextual and depends on your use case whether you use the standard deviation approach or the Z-score approach.
7. Let's practice!
Now having covered our final concepts behind standard deviation, upper and lower control limits, and the Z-score, it's time for you to put this into practice!