Get startedGet started for free

Detecting univariate outliers

1. Detecting univariate outliers

In this lesson we will again detect anomalies but this time using robust statistics.

2. Outliers

Fraudsters may be detected by the fact that their behavior is deviant and outlier detection is hence an important tool. Note however that not all outliers represent fraudulent observations, so follow-up and validation are needed.

3. Outlier detection

Observations with a large z-score in absolute value are typically considered to be outliers. The z-score measures how many standard deviations an observation lies away from the mean for a specific variable. It is calculated by first subtracting the mean and then dividing by the standard deviation.

4. Example

We look at a small dataset containing the income of 10 persons after log transformation. The last person earns a lot more money than the other ones. Let us calculate the z-score of each observation. All z-scores are smaller than 3 and therefore the last observation is unfortunately not flagged as an outlier. The z-score of the outlier is too small because it uses the classical estimators, sample mean and standard deviation, which are very sensitive to outliers. The sample mean is attracted towards the outlier whereas the standard deviation is much higher due to the outlier. A reliable tool for anomaly detection is robust statistics.

5. Robust statistics

Robust statistics aims at deriving methods that also produce reliable results when there are outliers in the data. Robust methods fit the majority of the data well: if the data contain no outliers they give approximately the same results as the classical method, while if the data are contaminated by outliers they give approximately the same results as the classical method applied to the outlier-free data. As a consequence, they provide a very reliable method of detecting outliers.

6. Estimators of location: mean & median

The mean is very sensitive to outliers. Even replacing a single observation by a very large value can change the mean completely. If we delete the outlier, then the classical and robust estimate are almost identical. On the other hand, the median can resist almost 50% of outliers. The robustness of the median comes at a price: in a gaussian model it is less efficient than the mean.

7. Estimators of scale: sd

The sample standard deviation is very sensitive to outliers.

8. Estimators of scale: mad, IQR

Frequently used robust alternatives are the median absolute deviation and the interquartile range.

9. Robust z-scores for outlier detection

By plugging in the robust estimators for location and scale, we obtain robust z-scores. Using robust z-scores and cut-off value 3, the outlier now is correctly flagged.

10. Boxplot

A very popular graphical technique for analyzing a univariate dataset is Tukey's boxplot. In this plot a box is drawn from the first to the third quartile. Note that the length of this box equals the interquartile range. Every observation outside the boxplot fence is then marked on the plot as an outlier.

11. Example: length of stay (LOS) in hospital

This dataset contains the length of stay for 201 patients at the University Hospital of Lausanne during the year 2000 and may be used to predict the total resource consumption. By drawing a boxplot, we see that many persons have stayed longer than expected. We can also identify the outliers. Is this an indication for insurance fraud?

12. Adjusted boxplot

If we classify all points outside the fence as outliers, then we implicitly assume that the data are normally distributed. In practice, data are often not symmetrically distributed and therefore the skewness-adjusted boxplot of Hubert and Vandervieren modifies the fences. For these plots we generated data from a very skewed distribution to illustrate the effect.

13. Example LOS: adjusted boxplot

When applying the adjusted boxplot, we see that only three observations are flagged as outlying. It might be interesting to inspect these atypical points.

14. Example LOS: boxplot vs adjusted boxplot

We can also compare the plots.

15. Let's practice!

Now it is your turn to detect outliers using robust z-scores and the boxplot!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.