Get startedGet started for free

Reducing identification risk with generalization

In this exercise, you will apply generalization on the IBM HR Analytics Employee Attrition & Performance dataset.

More specifically, you will transform the variable monthly_income to a binary column. The threshold to use for the transformation will be the mean value rounded up of the salaries. New values will be 0 for those that are less than or equal to the integer mean, and 1 for those greater.

The dataset is loaded as a pandas DataFrame hr.

This exercise is part of the course

Data Privacy and Anonymization in Python

View Course

Exercise instructions

  • Calculate the mean value of the monthly_income column using .mean() and round it to an integer. Save it as mean_income.
  • Apply a lambda function to hr['monthly_income'] to generalize the incomes to be 0 for values less than or equal to the mean_income, and 1 for those that are greater.
  • Explore the first five rows of the resulting DataFrame hr.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Calculate the mean value of incomes
mean_income = ____

# Apply generalization by transforming to binary data
hr['monthly_income'] = ____

# See resulting DataFrame
print(____)
Edit and Run Code