Get Started

Handle outliers with winsorization

Given is a basetable with two variables: "sum\_donations" and "donor\_id". "sum_donations can contain outliers when donors have donated exceptional amounts. Therefore, you want to winsorize this variable such that the 5% highest amounts are replaced by the upper 5% percentile value.

This is a part of the course

“Intermediate Predictive Analytics in Python”

View Course

Exercise instructions

  • Print the minimum value of sum_donations and verify that it is at least 0. Then print the maximum value of sum_donations.
  • Fill out the appropriate lower limit percentile. As all values higher than 0 are realistic and occur often, it is not necessary to replace values lower than the lower limit percentile value.
  • Create a new variable "sum_donations_winsorized" that is a winsorized version of the "sum_donations" variable.
  • Print the maximum value of sum_donations_winsorized.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from scipy.stats.mstats import winsorize

# Check minimum sum of donations
print(____["____"].____())
print(____["____"].____())

# Fill out the lower limit
lower_limit = ____

# Winsorize the variable sum_donations
basetable["sum_donations_winsorized"] = ____(____["____"], limits=[lower_limit, 0.05])

# Check maximum sum of donations after winsorization
print(____["____"].____())

This exercise is part of the course

Intermediate Predictive Analytics in Python

IntermediateSkill Level
5.0+
2 reviews

Learn how to prepare and organize your data for predictive analytics.

Once you derived variables from the raw data, it is time to clean the data and prepare it for modeling. In this Chapter we discuss the steps that need to be taken to make your data modeling-ready.

Exercise 1: Creating dummiesExercise 2: Creating a dummy from a two-category variableExercise 3: Creating dummies from a many-categories variableExercise 4: Missing valuesExercise 5: How to replace missing valuesExercise 6: Creating a missing value dummyExercise 7: Replace missing values with the median valueExercise 8: Replace missing values with a fixed valueExercise 9: Handling outliersExercise 10: Influence of outliers on predictive modelsExercise 11: Handle outliers with winsorization
Exercise 12: Handle outliers with standard deviationExercise 13: TransformationsExercise 14: InteractionsExercise 15: Square root transformationExercise 16: Adding interactions to the basetable

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free