1
Introduction to Data Privacy
Free
Get ready to apply anonymization techniques such as data suppression, masking, synthetic data generation, and generalization. In this chapter, you’ll learn how to distinguish between sensitive and non-sensitive personally identifiable information (PII), quasi-identifiers, and the basics of the GDPR. You'll also encounter real-life examples of what can go wrong if you don't follow these best practices.
2
More on Privacy-Preserving Techniques
Discover how to anonymize data by sampling from datasets following the probability distribution of the columns. You’ll then learn how to apply the k-anonymity privacy model to prevent linkage or re-identification attacks and use hierarchies to perform data generalization in categorical variables.
3
Differential Privacy
Learn about differential privacy, the model used by major technology companies such as Apple, Google, and Uber. In this chapter, you’ll explore data by generating private histograms and computing private averages in data. You’ll also create differentially private machine learning models that allow businesses to increase the utility of their data.
4
Anonymizing and Releasing Datasets
In this final chapter, you’ll learn how to apply dimensionality reduction methods such as principal component analysis (PCA) to anonymize large multi-column datasets. You’ll then use Faker to generate realistic and consistent datasets, and scikit-learn to create synthetic datasets that follow a normal distribution. Lastly, you’ll tie everything you learned in this course together as you combine multiple techniques to safely release datasets to the public.

Initializing

Reducing identification risk with generalization

In this exercise, you will apply generalization on the IBM HR Analytics Employee Attrition & Performance dataset.

More specifically, you will transform the variable monthly_income to a binary column. The threshold to use for the transformation will be the mean value rounded up of the salaries. New values will be 0 for those that are less than or equal to the integer mean, and 1 for those greater.

The dataset is loaded as a pandas DataFrame hr.

Calculate the mean value of the monthly_income column using .mean() and round it to an integer. Save it as mean_income.
Apply a lambda function to hr['monthly_income'] to generalize the incomes to be 0 for values less than or equal to the mean_income, and 1 for those that are greater.
Explore the first five rows of the resulting DataFrame hr.