Get startedGet started for free

Effect of an outlier

1. Effect of an outlier

Just as violating technical conditions can impact the accuracy of a p-value, one or a handful of outlying values can also have an unintended impact on the regression inferential procedure.

2. Placeholder

Recall the linear model regressing protein on fiber. You may have noticed previously that there was one food item with quite a bit of fiber and relatively little protein. Additionally, maybe it turns out that you'd like to model only foods with relatively low fiber. That is, if we remove the high fiber food, we can create a linear model which describes the relationship between fiber and protein only for foods that have less than 15g of fiber.

3. Different regression lines

The decision of whether or not to keep the high fiber food will have an impact on the regression line describing the relationship between protein and fiber. Notice that by removing the the outlier, the regression line changes from the original least squares estimate.

4. Different regression lines

The two regression lines have been superimposed so that the change in relationship is easier to see. Keep in mind that the red line models foods with less than 15g of fiber because we have subsetted the explanatory variable. We would never remove a data point simply because it didn't fit a particular model, we only remove the point if we are interested in describing only a subset of the observations.

5. Different regression models

In this code, we use the original dataset as well as the dataset which is filtered to include only values for which Fiber is less than 15. The linear model is given for both datasets. As expected, the values of the slope and intercept change depending on which data values are included in their calculations. And although the p-value given for the statistical inference is statistically significant in both situations, the p-values are different by a factor of 5.

6. Different regression randomization tests

Similar to the mathematical analysis, the randomization test on the slope can be run for the full dataset or the low-fiber dataset. Again, both tests are well into the statistically significant range.

7. Let's practice!

Thanks for following along with this video, now it is your turn to practice, but remember, never remove points from the dataset unless you know the observed value to be incorrect or unless you are modeling a subset of your explanatory variable.