1. Removing outliers
You will often find that even after performing these transformations, your data is still very skewed. This can often be caused by outliers existing in your data.
2. What are outliers?
Outliers are data points that exist far away from the majority of your data. This can happen due to several reasons, such as incorrect data recording to genuine rare occurrences. Either way you will often want to remove these values as they can negatively impact your models. An example of the negative effect can be seen here where an outlier is causing almost all of the scaled data to be squashed to the lower bound.
3. Quantile based detection
The first approach we will discuss is to remove a certain percentage of the largest and/or smallest values in your data. For example you could remove the top 5%. This is achieved by finding the 95th quantile (the point below which 95% of your data resides) and removing everything above it.
This approach is particularly useful if you are concerned that the highest values in your dataset should be avoided. When using this approach, you must remember that even if there are no real outliers, you will still be removing the top 5% of values from the dataset.
4. Quantiles in Python
To find the 95th quantile, you can call the quantile() method with 0.95 as the argument on the column. You can then create a mask to find which values lie below the 95th quantile and subset the data accordingly.
5. Standard deviation based detection
An alternative, and perhaps more statistically sound method of removing outliers is to instead choose what you consider to be outliers based on the mean and standard deviations of the dataset.
For example you may want to eliminate all data greater than 3 standard deviations from the mean as you expect those data points to be outliers. This approach has the benefit of only removing genuinely extreme values, for example if only one value was an outlier, only that value would be effected.
6. Standard deviation detection in Python
To apply this in Python, you first need to find the mean and standard deviation of your column by calling the mean() and std() methods on the column, respectively.
You then find upper bound by adding 3 times the standard deviation to the mean and similarly find the lower bound by subtracting 3 times the standard deviation from the mean.
Once you have found these bounds, you can apply these bounds as a mask to the DataFrame as shown here.
This method ensures that only data that is genuinely different from the rest is removed, and will remove fewer points if the data is close together.
7. Let's practice!
Now it's time for you to put what you have learned about outliers into practice.