How to deal with identified outliers

1. How to deal with found outliers

We've talked a lot about identifying outliers but haven't discussed what to do when we find them. We've considered removing them, but this isn't always what we want in practice.

2. Applications of anomaly detection

In certain contexts, anomaly detection is used solely to detect the outliers and learn how and why they happen. The examples are in medicine (detecting malignant tumors or cancer), cyber security, fraud detection and so on. In these scenarios, finding the outliers and analyzing them delivers valuable insights into what constitutes an anomalous behavior and how to deal with it. Practitioners usually perform the same analyses with and without outliers and highlight the differences in the results.

3. The reasons for outlier presence

Before dropping, we should find out the outlier type and why it occurred. Usually, outliers happen for three reasons. First reason is data entry errors. This includes typos, measurement errors, human mistakes, or bad signals. If we can't correct the error like fixing the typo, or rerecording that particular datapoint, we can safely drop them because we know they are incorrect values. The second reason is sampling problems. Studies use samples to draw conclusions about a specific population. Sometimes, it is possible the data collector might accidentally obtain items that aren't from the target population. An example is having female respondents in a survey designed for men with addiction problems. If a sample isn't within the normal characteristics of the chosen population, it is safe to exclude it as well. The last reason is natural variability. Weird and odd things happen all the time in nature and the real-world. If we take a large enough sample size, we are bound to find outliers even if they are representative of our population. They aren't necessarily problems but are actually part of the distribution. When this happens, we cannot remove the outliers because we would be trying to make the sample less variable than it is in reality. Even if having such outliers may hurt model performance, we can't simply drop them for the sake of better metrics.

4. Drop based on magnitude

The decision to drop the anomalies is also affected by their magnitude. If there are only a few relative to the dataset size and domain experts confirm that they are non-representative of the majority, it is safe to exclude them for transparency. In contrast, if there are too many of them, enough to raise a suspicion, then, they are definitely not rare occurrences. It is highly likely that there is some unknown process that generated those datapoints and they should be carefully explored. At this point, we can't simply drop them to increase model accuracy. Instead, more "outlier-friendly" options should considered like Generalized Linear Models (GLMs), quantile regression or Generalized Estimating Equations. There is also the case that there are so many outliers in the data that they collectively form a cluster. In this case, they become a new sub-group and their features and reasons for their presence should be carefully analyzed.

5. Trimming

There are also some non-aggressive methods of dealing with outliers when we don't want to drop them. The first one is trimming or in other words, forcing a distribution to be within a certain range. For example, below, we are trimming the Volume column of Google stocks using the first and 99th percentiles. We find the first and 99th percentiles using the quantile function of pandas Series and pass them to the clip method. The method will replace any values beyond the percentiles with the percentiles themselves.

6. Replacing

You can also replace outliers with hard-coded values. pandas has a handy replace function for such purposes. The above code will replace all days that had no stocks traded with 100.

7. Let's practice!

Now, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.