Sampling from the best continuous distribution
Random sampling from a well-fitting probability distribution helps maintain privacy. At the same time, it allows authorized parties to conduct an accurate statistical analysis of the data.
In this exercise, you will anonymize the column monthly_income
from the IBM dataset. In the previous lesson, you determined the exponnorm
continuous distribution to be the best fit. Use it to model the incomes.
The dataset is available as hr
.
Diese Übung ist Teil des Kurses
Data Privacy and Anonymization in Python
Anleitung zur Übung
- Import the
stats
module from thescipy
package. - Fit the
exponnorm
distribution to the continuous variablemonthly_income
to obtain the parameters of the distribution and later generate the samples. - Sample from the
exponnorm
distribution and replacemonthly_income
using the.rvs()
method. Specify the size to be the same as the length of the column. - Round the salaries to their closest integer.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Import stats from scipy
____
# Fit the exponnorm distribution to the continuous variable monthly income
params = ____
# Sample from the exponnorm distribution and replace monthly income
hr['monthly_income'] = ____
# Round the salaries to their closest integer
hr['monthly_income'] = ____
# See the resulting dataset
print(hr.head())