Reducing identification risk with generalization
In this exercise, you will apply generalization on the IBM HR Analytics Employee Attrition & Performance dataset.
More specifically, you will transform the variable monthly_income
to a binary column. The threshold to use for the transformation will be the mean value rounded up of the salaries. New values will be 0 for those that are less than or equal to the integer mean, and 1 for those greater.
The dataset is loaded as a pandas
DataFrame hr
.
This exercise is part of the course
Data Privacy and Anonymization in Python
Exercise instructions
- Calculate the mean value of the
monthly_income
column using.mean()
and round it to an integer. Save it asmean_income
. - Apply a
lambda
function tohr['monthly_income']
to generalize the incomes to be 0 for values less than or equal to themean_income
, and 1 for those that are greater. - Explore the first five rows of the resulting DataFrame
hr
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Calculate the mean value of incomes
mean_income = ____
# Apply generalization by transforming to binary data
hr['monthly_income'] = ____
# See resulting DataFrame
print(____)