Get startedGet started for free

Data outliers and scaling

1. Data outliers and scaling

In the last lesson, we discussed data distributions and transformations. In this video, we'll cover two additional preprocessing steps, finding and handling outliers and how and when to scale your data.

2. Outliers

Outliers are defined as one or more observations that are distant from the rest of the observations in a given feature. When looking at a histogram of a feature, outliers tend to show up in the tails as you see in this image.

3. Inter-quartile range (IQR)

The inter-quartile range or IQR is defined as the difference of the values at the 1st and 3rd quartiles, which are at 25% and 75%, respectively, with the median exactly between at 50%. In general, those points above and/or below 1.5 times the IQR should be suspected as possible outliers, which corresponds to the shaded regions seen here. Individual points carry less weight overall in a large dataset than the same datapoint in a smaller dataset. And, a point that is only twice as large as your upper boundary is less concerning than one that is ten times as large.

4. Line of best fit

Looking at a simple linear regression model of a dataset with and without outliers reveals just how influential the extreme points are for this particular data. The slope and intercept coefficients are vastly different between the two. A thorough investigation should be undertaken to justify why to remove them or not. And, it's totally possible these anomalies are considered crucial when designing a ML model whose purpose is to detect such anomalous behavior.

5. Outlier functions

Some of the functions you'll encounter in the exercises are from the seaborn module where the boxplot function used on our target variable Loan Status supplied to y gives conditioned boxplots, distplot gives a histogram with a kde. Numpy's abs function returns an absolute value. From the scipy module stats.zscore calculates the z-score and mstats.winsorize is a handy function that, given a list of limits replaces outliers, in this example with the 5th percentile and 95th percentile data values. And, finally, numpy's .where function evaluates a condition given as the first argument, and replaces it with the values specified by the second when true or by the last when it evaluates to false.

6. High vs low variance

This image shows 2 normal distributions that have different variances, which represents the average deviation from the mean in a distribution. In a machine learning framework, the high variance feature will be chosen more often than a low variance feature, making it seem more influential, when it may not be. The solution to this problem is to scale your data when the dataset contains features that have ranges that vary greatly.

7. Standardization vs normalization

Sometimes the terms for scaling, most notably normalization and standardization, are used interchangeably. But let's clarify their definitions to avoid any confusion. Standardizing your data, also known as z-score, takes each value minus the mean and divides it by the standard deviation, giving it a mean of zero and variance one. Normalization, also seen as min max normalizing takes each value minus the minimum and divides by the range. This has the effect of scaling the features between zero and one. So both approaches are scaling the data, they just do so differently.

8. Scaling functions

In the exercises, you'll use two functions from scitkitlearn's preprocessing module. StandardScaler standardizes to mean 0 and sd 1 while MinMaxScaler normalizes the data to lie from 0 to 1.

9. Outliers and scaling

Here is another multiple choice question before heading over to the exercises. How should outliers be identified and properly dealt with? What result does min/max or z-score standardization have on data? Select the statement that is true. If the answer is not immediately apparent, pause this video to read through the possible answers and give yourself a moment to think about it. If you still aren't sure, consider re-watching this video lesson up to this point and pay particular attention to the definition of outliers, when outliers are helpful, and what each type of scaling does to the data before revealing the answer in the next slide.

10. Outliers and scaling: answer

The correct answer is that, in certain contexts where the goal is to find fraud or cybersecurity events, for example, data anomalies are required in order to create a predictive ML model to detect them in the future.

11. Outliers and scaling: incorrect answers

These are the reasons why the other answers are incorrect, make sure you understand them.

12. One last thing...

To put everything we've covered so far into better perspective, these are the steps and the order they should be followed.

13. Let's practice!

Now, it's time for some practice.