Handle outliers with winsorization
Given is a basetable
with two variables: "sum\_donations"
and "donor\_id"
. "sum_donations
can contain outliers when donors have donated exceptional amounts. Therefore, you want to winsorize this variable such that the 5% highest amounts are replaced by the upper 5% percentile value.
This is a part of the course
“Intermediate Predictive Analytics in Python”
Exercise instructions
- Print the minimum value of
sum_donations
and verify that it is at least 0. Then print the maximum value ofsum_donations
. - Fill out the appropriate lower limit percentile. As all values higher than 0 are realistic and occur often, it is not necessary to replace values lower than the lower limit percentile value.
- Create a new variable "sum_donations_winsorized" that is a winsorized version of the "sum_donations" variable.
- Print the maximum value of
sum_donations_winsorized
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from scipy.stats.mstats import winsorize
# Check minimum sum of donations
print(____["____"].____())
print(____["____"].____())
# Fill out the lower limit
lower_limit = ____
# Winsorize the variable sum_donations
basetable["sum_donations_winsorized"] = ____(____["____"], limits=[lower_limit, 0.05])
# Check maximum sum of donations after winsorization
print(____["____"].____())
This exercise is part of the course
Intermediate Predictive Analytics in Python
Learn how to prepare and organize your data for predictive analytics.
Once you derived variables from the raw data, it is time to clean the data and prepare it for modeling. In this Chapter we discuss the steps that need to be taken to make your data modeling-ready.
Exercise 1: Creating dummiesExercise 2: Creating a dummy from a two-category variableExercise 3: Creating dummies from a many-categories variableExercise 4: Missing valuesExercise 5: How to replace missing valuesExercise 6: Creating a missing value dummyExercise 7: Replace missing values with the median valueExercise 8: Replace missing values with a fixed valueExercise 9: Handling outliersExercise 10: Influence of outliers on predictive modelsExercise 11: Handle outliers with winsorizationExercise 12: Handle outliers with standard deviationExercise 13: TransformationsExercise 14: InteractionsExercise 15: Square root transformationExercise 16: Adding interactions to the basetableWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.