Get startedGet started for free

Outliers

1. Outliers

We've discussed three different aspects of a distribution that are important to note

2. Characteristics of a distribution

when conducting an exploratory data analysis: center, variability, and shape. A fourth and final thing to look for are outliers. These are observations that have extreme values far from the bulk of the distribution. They're often very interesting cases, but they're also good to know about before proceeding with more formal analysis.

3. Insert title here...

We saw some extreme values when we plotted the distribution of income for counties on the West Coast. What are we to make of this blip of counties? One thing we can do is try this as a box plot. Here I've added an additional layer that flips the coordinates so that they boxes are stretched out horizontally to make the comparison with the density plot easier. What we see is interesting: the box plot flags many counties as outliers, both along the West Coast but in the rest of the country as well. So why was the blip more apparent on the West Coast? It has to do with sample size. There are far fewer counties in the West Coast group, so these few outliers had an outsized effect on the density plot. In the case of the non West Coast group, there are many many more counties that were able to wash out the effect of these outliers in the density plot.

4. Indicating outliers

It is often useful to consider outliers separately from the rest of the data, so lets create a new column to store whether or not a given case is an outlier. This requires that we mutate a new column called is outlier that is TRUE if the income is greater than some threshold and FALSE otherwise. In this case, we've picked a threshold for outliers as counties with incomes greater than $75,000. We can actually inspect the outliers by filtering the dataset to only include outliers, then arrange them in decreasing order of income. Because we didn't save this dplyr chain back to an object, it just prints the sorted outliers to the console. We learn that the top income county is actually Teton County, in Wyoming, and that three of the top ten counties are in Texas and two are in Nebraska. We can also try rebuilding the density plots without the outliers.

5. Plotting without outliers

To do this, we form a dplyr chain where the first step is to filter on those counties that are not outliers. Recall that is outlier is a vector of TRUEs and FALSEs. Those values can be reversed using an exclamation point, forming a variable that indicates the counties that are not outliers. We then pipe this into the same code we used for the overlaid density plots. The result is a plot that focuses much more on the body of the distribution. You can contrast that with the original plot, which was dominated by the strong right skew caused by the extreme values. Note that neither of these plots is right or wrong, but they tell different stories of the structure in this data, both of which are valuable.

6. Let's practice!

OK, now it's your turn to practice exploring outliers in the Gapminder data.