Handling outliers
In the last exercise, you learned how visualizing outliers could come in handy in a machine learning interview. Another convenient way for handling outliers is by calculating the Z-score which gives a threshold for outliers approximately +/-3 standard deviations away from the mean.
In this exercise, you will use the scipy.stats
module to calculate the Z-score using the stats.zscore()
function and the mstats.winsorize()
function to replace outliers using a technique called Winsorizing.
Recall from the video lesson that those points above and/or below 1.5 times the IQR should be suspected as possible outliers. For the last step in this exercise, that value is 2120.
The relevant packages have been imported for you, and loan_data
's numeric and categorical columns have been subset and saved as numeric_cols
and categoric_cols
, respectively.
This exercise is part of the course
Practicing Machine Learning Interview Questions in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print: before dropping
print(numeric_cols.mean())
print(numeric_cols.median())
print(numeric_cols.max())
# Create index of rows to keep
idx = (np.____(stats.____(____)) < 3).all(axis=1)
# Concatenate numeric and categoric subsets
ld_out_drop = pd.concat([numeric_cols.loc[____], categoric_cols.loc[____]], axis=1)
# Print: after dropping
print(ld_out_drop.mean())
print(ld_out_drop.median())
print(ld_out_drop.max())