Handling outliers

1. Handling outliers

In this video, you will learn how to deal with outliers, values that are so extreme that they can drastically influence your model.

2. Influence of outliers on predictive models

Consider the predictive modeling problem where you want to predict whether someone will donate for a certain campaign, using the predictive variable `age`. This value age is entered manually by the donor using an online internet form, and one of the donors accidentally entered 500 as value. If you look at the logistic regression model, you can see that one single outlier can have huge influence on your model as illustrated by the dotted lines. Therefore, it is best to handle so-called outliers before diving into the modeling process.

3. Causes of outliers

Such outliers can enter your data by human errors, but also by measuring errors in machines. In some cases, there are outliers in the data that are correct, but should still be replaced as the extreme values are not representative for the data. Handling outliers is a huge domain of research in data science. In the following, we will discuss two pragmatic approaches to replace missing values, namely winsorization and an approach using the standard deviations.

4. Winsorization concept

Winsorization is a very simple technique, where the extreme values are limited to lower and upper limits that are calculated based on percentiles. For instance, consider the variable `mean_donations` that has the mean donation someone gave. Assume that 5% of all mean donations is below 6 euros, and 5% of all mean donations is above 950 Euros. Then winsorization replaces all values lower than 6 euros by 6 euros, and changes all values higher than 950 Euros by 950 Euros.

5. Winsorization in Python

In python, this can easily be done using the scipy mstats package. The first argument of the winsorize method takes the original pandas dataframe column, the second argument indicates which lower and upper limits are used. For instance, if limits is set to 0.05 and 0.01, it means that all values that are lower than the 5% lowest values are replaced by the 5 percentile value, and that all values that are higher than the 99% lowest values are replaced by the 99 percentile value.

6. Standard deviation method concept

Another way to handle missing values is by using a simple rule of thumb. Instead of defining the lower and upper limit using percentiles, you can define the lower and upper limits using the standard deviation and mean value of the original variable. An often used value for the lower limit is the mean value minus 3 times the standard deviation, and the mean value plus 3 times the standard deviation for the upper limit. Values that are lower than the lower limit are replaced by the lower limit, and values higher than the upper limit are replaced by the upper limit.

7. Standard deviation method in Python

This can easily be done in Python. First, you calculate the mean and standard deviation of the variable, and use these values to define the lower and upper limit. Next, you can replace values that are higher than the upper limit or lower than the lower limit using the minimum and maximum operations. Indeed, by taking the maximum of a and the lower limit, the lower limit is applied, and by taking the minimum of this value and the upper limit, the upper limit is applied.

8. Let's practice!

Time to put this into practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.