Handling outliers

In the last exercise, you learned how visualizing outliers could come in handy in a machine learning interview. Another convenient way for handling outliers is by calculating the Z-score which gives a threshold for outliers approximately +/-3 standard deviations away from the mean.

In this exercise, you will use the scipy.stats module to calculate the Z-score using the stats.zscore() function and the mstats.winsorize() function to replace outliers using a technique called Winsorizing.

Recall from the video lesson that those points above and/or below 1.5 times the IQR should be suspected as possible outliers. For the last step in this exercise, that value is 2120.

The relevant packages have been imported for you, and loan_data's numeric and categorical columns have been subset and saved as numeric_cols and categoric_cols, respectively.

Machine learning pipeline

1
- Create an index of rows to keep for absolute z-scores less than 3 on the numeric columns and use it to index and concatenate subsets.

2
- Winsorize 'Monthly Debt' with 5% upper and lower limits and print the mean, median and max before and after.
3
- Find the median of the values of Monthly Debt that are lower than 2120 and replace outliers with it.

Data Pre-processing and Visualization

Supervised Learning

Unsupervised Learning

Model Selection and Evaluation

Exercise

Handling outliers

Instructions 1/3