Testing the extremes with Grubbs' test

1. Testing the extremes with Grubbs' test

In this lesson, you'll learn a statistical procedure called Grubbs' test, which can help in assessing whether the data contain outliers.

2. Visual assessment is not always reliable!

A visual assessment for outliers like a boxplot can work really well when the majority of the data are close together, and a few outliers are clearly separated. This boxplot shows the temperature data from the previous video. It's reasonable to think that the point circled in red lies close enough to the majority of the data that we can't be certain that it's an outlier. When this happens, we can use a statistical test called Grubbs' test to make sure.

3. Grubbs' test

Grubbs' test assesses whether the point that lies farthest from the mean in a dataset could be an outlier. This point will either be the largest or smallest value in the data. Grubbs' test works by assuming that the data are normally distributed and it is therefore important to first ensure that this assumption is plausible for the data you're analyzing, before proceeding to use the test. A histogram provides a common way to check the normality assumption visually.

4. Checking normality with a histogram

In R, a histogram is produced using the hist function. The main argument of the hist function is the data to show, while the breaks argument controls how many bins the histogram has. For larger datasets, breaks can be increased to get a more detailed view of the distribution. When checking for normality, we should be aware of both the symmetry and shape of the histogram. Normally distributed data, should have an approximately symmetrical and bell-shaped histogram, which is roughly true for the temperature data. If the data distribution seems lop-sided or has more than a single peak, then the Grubbs' test should not be used. If you come across data like this, that's ok, you'll learn other techniques later in the course that can help in this case!

5. Running Grubbs' test

In R, Grubb’s test is performed using the grubbs dot test function, and the result of its use for the temperature data is shown. There are two key pieces of information to look out for in the text returned by the function, the data point that was tested, and the p-value associated with the test. The point that was tested will either have been the maximum or minimum value in the data, whichever is farthest from the mean. The final line of output shows that the value tested was the maximum temperature of 30-celsius.

6. Interpreting the p-value

The p-value is a number between 0 and 1 that measures how much evidence there is that the tested point is an outlier. Values near to 0 indicate stronger evidence that the tested point was an outlier, while values near to 1 indicate weaker evidence. Any point whose p-value is below 0 point 05 should be treated with suspicion. In the output shown, the p-value reported is 0 point 00179 which indicates that the 30 celsius spring day should be considered an outlier compared to the other temperatures.

7. Get the row index of an outlier

Finally, to find out the integer row position of the maximum or minimum value that was tested, we can use the which dot max or which dot min function. The Grubbs' test you just saw tested the maximum value of 30, and therefore the which dot max function should be used. The maximum value occurs in row 5 of the temperature data. Correspondingly, to find the minimum value, the which dot min function should be used as shown here.

8. Let's practice!

Now it's your turn.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.