Outlier-robust feature scaling

1. Outlier-robust feature scaling

At the end of the last video, we mentioned that KNN is sensitive to features with disproportionate scales relative to their importance. Why does this matter?

2. Euclidean distance

Since KNN uses the distances between instances as anomaly scores, the scores can get skewed when highly-scaled features are involved. To illustrate, let's use two points, A and B in 3D space. Point A has a range of coordinates 1-9, whereas Point B has a range of 10-100. To find the distance, we use the common Euclidean distance metric, which is calculated as displayed, in numpy. Each individual coordinate of point A is subtracted from point B and squared. The results are summed and the square root is found.

3. Euclidean in SciPy

To perform these calculations more quickly, we can also use the euclidean function from scipy-dot-spatial-dot-distance module. The distance is 91-point-4 - an undesirably large number considering the small magnitude of coordinates.

4. Standardization

To solve such problems, a number of feature scaling techniques have been developed. One of the most common is standardization. When a feature is standardized, its mean is subtracted from all instances and divided by the feature standard deviation. This has the effect of making the feature have a mean of zero and a standard deviation of one.

5. StandardScaler

Standardization is implemented using the StandardScaler transformer in the sklearn-dot-preprocessing module. After initializing the transformer, we have to extract the feature and target arrays. Here, we will use weight in kilograms column as the target. Then, we fit ss to the feature array. During the fit, StandardScaler learns the mean and the standard deviation of every feature.

6. Transforming

To make the transformation, we use the transform method.

7. fit_transform

It is also possible to fit and transform simultaneously with fit_transform.

8. QuantileTransformer

However, StandardScaler is prone to the same pitfall as z-scores - it uses mean and standard deviation in its calculations, which can be skewed when there are many outliers in the dataset. To get around this problem, we can use another transformer called QuantileTransformer. Instead of mean and STD, QuantileTransformer uses quantile information, which stays the same regardless of the magnitude of outliers. Let's apply this to the Ansur males dataset. We import the estimator from the preprocessing module of sklearn. After separating the feature and target arrays, we use the transformer's fit_transform method on X. The result is a numpy array with the same shape as X, with all features scaled individually.

9. Preserving column names

To preserve the column names of the males DataFrame after the transformation, we can use the following pandas trick. By accessing all rows and columns with a double colon inside the dot-loc accessor, we manage to keep the column names and update all DataFrame cells simultaneously.

10. Uniform histogram

To visualize the effect of the transformation, let's plot the histogram of the foot length column. By default, QuantileTransformer casts the features to a uniform distribution as seen in the plot.

11. Normal histogram

However, in most cases, a normal distribution is preferred, so we change the output_distribution of QuantileTransformer to normal and repeat the process. Great! The histogram now shows that all features of males have similar scales.

12. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.