Get startedGet started for free

Outliers, leverage, and influence

1. Outliers, leverage, and influence

Sometimes, datasets contains weird values. Here, we'll look at how to spot the weird values, and the consequences they have for your regression models.

2. Roach dataset

Let's look at another part of the fish dataset, this time filtering for the roaches.

3. Which points are outliers?

Here's the standard plot of mass versus length. The technical term for an unusual data point is an outlier. So which of these points constitutes an outlier?

4. Extreme explanatory values

The first kind of outlier is when you have explanatory variables that are extreme. In the simple linear regression case, it's easy to find and visualize them. There is one really short roach and one really long roach that I've colored cyan here.

5. Response values away from the regression line

The other property of outliers is when the point lies a long way from the model predictions. Here, there's a roach with mass zero, which seems biologically unlikely. It's shown as a triangle.

6. Leverage

Leverage quantifies how extreme your explanatory variable values are. That is, it measures the first type of outlier we discussed. With one explanatory variable, you can find the values by filtering, but with many explanatory variables, the mathematics is more complicated. To calculate leverage, you need a model object. For historical reasons, the leverage function is called hatvalues. Like the fitted values and residuals functions, it returns a numeric vector with as many values as there are observations.

7. The .hat column

augment, from the broom package, will also calculate leverage. The values are stored in the dot-hat column.

8. Highly leveraged roaches

Let's find the values with high leverage. After augmenting, we select the columns of interest: the mass, the length, and dot-hat, renamed here as "leverage". Then we arrange the rows by descending leverage values and get the head. The top two are the same observations we identified earlier. The really long roach and the really short roach.

9. Influence

A related concept to leverage is influence. This is a type of "leave one out" metric. That is, it measures how much the model would change if you reran it without that data point. I like to think of it as the torque of the point. The amount of turning force, or torque, when using a wrench is equal to the linear force times the length of the wrench. In a similar way, the influence of each observation is based on the size of the residuals and the leverage. It isn't a straightforward multiplication; instead we use a metric called Cook's distance.

10. Cook's distance

The calculations for Cook's distance require some linear algebra, but the important thing to know is that it is based on the size of the residuals and the leverage, and that a bigger number denotes more influence for the observation. The cooks-dot-distance function returns the values as a vector.

11. The .cooksd column

Here's the output from augment again. Cook's Distance is contained in the dot-cooksd column.

12. Most influential roaches

Using the same approach, we can find the most influential roaches. We augment, then select the columns of interest, then arrange to get the top values. Here, you can see the two points that were highly leveraged, and the fish with zero mass that gave it a large residual.

13. Removing the most influential roach

To see how influence works, let's remove the most influential roach. This is the one with the shortest length, at twelve-point-nine centimeters. We draw the usual plot, but add another regression line using the dataset without that short fish. The slope of the line has completely changed just by having one less data point.

14. autoplot()

autoplot also lets you draw diagnostic plots of leverage and influence, by setting the which argument to 4, 5, or 6. I find these plots less helpful for diagnosis than the previous three we looked at, other than seeing the labels of the most influential observations.

15. Let's practice!

Let's get under the influence.