Get Started

Detecting multivariate outliers

1. Detecting multivariate outliers

Let's conclude this course by learning to detect multivariate outliers. Multivariate outliers are cases with an unusual combination of scores on different variables.

2. Animals data

We consider a bivariate dataset containing the body and brain weight of 28 animal species. A logarithmic transformation is applied on both variables.

3. Animals data: univariate outlier detection

The boxplots indicate that there are no univariate outliers, but outliers in the multivariate setting may still be present.

4. Animals data: scatterplot

We create a scatterplot of the logarithms of body and brain weight, which shows that there are actually three multivariate outliers. These points are not outlying in either variable individually. We can only detect such outliers by correctly estimating the covariance structure.

5. Mahalanobis distance

Remember that a z-score measures how many standard deviations away the observation is from its mean. The Mahalanobis distance is a multi-dimensional generalization of this idea and takes the covariance matrix into account. It measures the distance of the observation from its center divided by the width of the ellipsoid in the direction of the observation. Therefore it tells us how far away the observation is from the center of the cloud, relative to the size of the cloud. If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance.

6. Mahalanobis distance to detect multivariate outliers

Classical Mahalanobis distances are obtained by taking the sample mean as estimate for location and the sample covariance matrix as estimate for scatter. The square root of the 97.5% quantile of the chisquare distribution with p degrees of freedom is typically used as cut-off value. We can then create a tolerance ellipsoid containing the observations with Mahalanobis distance smaller than the cut-off value. We expect that about 97.5% of the observations belong to this ellipsoid, hence we can flag observation as an outlier if it does not belong to the classical tolerance ellipsoid.

7. Tolerance ellipsoid based on Mahalanobis distance

Getting the mean, covariance and radius, We construct a tolerance ellipsoid based on Mahalanobis distance.

8. Animals data: tolerance ellipsoid based on Mahalanobis distance

We see that the ellipsoid is inflated in the direction of the three outliers and therefore the outliers are not detected anymore. Therefore it is again important to plug in robust estimators.

9. Robust estimates of location and scatter

The Minimum Covariance Determinant or MCD is a robust estimator of multivariate location and scatter that looks for the h observations whose covariance matrix has the lowest possible determinant. h is typically chosen equal to 0.75n or 0.5n. The robust estimate for scatter is then just the sample covariance matrix of these h observations. A reweighting step is typically applied, to increase efficiency without decreasing robustness. Computation of MCD is non-trivial and requires an exhaustive investigation of all h-subsets out of n. Fortunately much faster algorithms are constructed and available in R.

10. Robust distance

Using the robustbase package we can apply MCD on the data and obtain robust estimates of location with center and scatter with cov. Plugging in these robust estimates in the definition of Mahalanobis distance yields robust distances.

11. Animals: robust tolerance ellipsoid

We construct the robust tolerance ellipsoid.

12. Animals: robust tolerance ellipsoid

The robust ellipsoid clearly identifies the 3 outliers (observations 6, 16 and 26) and we even see two other species, 14 and 17, close to the boundary.

13. Distance-distance plot

In higher dimensions (with more than 3 variables) it becomes infeasible to visualize the tolerance ellipsoid. The distance-distance plot is then a popular alternative. On the X-axis the classical Mahalanobis distances are plotted whereas on the Y-axis the robust distances are shown. The straight lines represent the cut-off values, again derived from the chisquare distribution. We immediately see which outliers are detected using classical and robust mahalanobis distances.

14. Animals: check outliers

In this example, the selected animals can indeed be seen as fraudsters between the other species since we flagged the only 3 dinosaurs in the dataset as clear outliers and the human and rhesus monkey as boundary points. They indeed have a suspicious brain versus body weight compared to most animals.

15. Let's practice!

In the last exercise you will need to flag potential insurance fraud cases