1. A deep dive into window calculations
In this demo, we will be revisiting the Tableau superstore dataset one last time where we'll explore the concepts of rolling standard deviations, setting upper and lower bounds, and how to work with Z-scores to classify anomalies!
Firstly, let’s drag in our sales and quantity fields into a new canvas where we’ll set the granularity to a weekly basis.
Previously we’ve identified sales periods that increase suddenly. Any business tends to want to understand these peaks and ask if there were any indicators that led to this peak. We can use the rolling standard deviation to assist us here.
Let’s modify our sales first to a window calculation using the window_stdev function, we’ll call it Window_Stdev_Sum.
We want to create a 4-week window - so we just need to use the arguments -3 and 0 as the granularity in our view is already set to week. It’s worth noting that the larger your window size, the smoother your data will become, so you’ll want to make sure your window isn’t too large, else you’ll lose the peaks in your data.
We’ll now repeat this process for the quantity measure and rename this Window_stdev_quantity
Now let’s clear our canvas and drag these two measures in instead.
The volatility is quite clear, but we want to understand how this relates to the different business segments so let’s drag in segment into the color measure for both of our window calculations and see what this tells us.
Take a moment to look at the visualization and study it carefully.
It seems that of the three segments, the consumer segment is clearly the most abnormal with the majority of peaks, whereas corporate and home office, apart from a few obvious peaks, have more stability regarding sales and quantity transactions.
However, this isn’t enough.
As data professionals, we need to be concrete in understanding what the upper and lower limits are for sales.
So let's set some upper and lower limits.
We’ll create a calculated field called upper limit that uses the window average and window stdev functions.
Note that as I haven’t explicitly set a window size - it is using the whole range of values for sales. This is because we want to use the whole range of values to establish our upper and lower limits.
We’ll then set a lower limit as well, and follow the same process, except we minus 3 standard deviations as opposed to adding them.
Let’s create a new canvas and drag in our date and sales, before setting the upper and lower limits as our dual axis.
Synchronizing the axes - we can clearly see our thresholds and cut-off points.
Now that we’ve set our upper and lower bounds - is there any other way we can identify anomalous values?
Lastly, we’ll cover Z-scores.
We know that any value that is greater than 3 Z-scores is considered anomalous. However, in some contexts, we can use a Z-score of 2, which equates, roughly, to the 95th percentile.
Let’s build out a calculated field called Z-score that calculates this for quantities.
The Z-score is calculated via the observed value minus the average value, which is then divided by the standard deviation for the set of values.
Note that as we’re interested in identifying the abnormal weekly quantities, we are aggregating our quantity measure using sum. You should always think through the context of your granularity and whether or not your aggregation makes sense.
Let’s create one last canvas.
We’ll drag in our quantity and order date, set to a weekly granularity.
Now we’ll drag our Z-score to our quantity line chart, modify this to a bar chart, and we can see where the Z score is spiking, which coincides with our anomalous periods!
You would note that the Z-score values range from roughly -1.5 to 4. This is where you can work out with your stakeholders whether those anomalies, that is, any value greater than 3 Z-scores, should be removed or retained within the dataset for analysis purposes!
2. Let's practice!
Now - it’s your turn to put this into practice!