Get startedGet started for free

Anonymizing categorical data

1. Anonymizing categorical data

Hello. In this video, we will learn more about how to efficiently generalize data, in particular categorical data.

2. Generalization

In the previous chapter you learned about data generalization and how to apply it by transforming data, for example, to binary attributes: such as the case of people older than 40, and people younger than 40.

3. Generalization of categorical data

But what about non-numerical data? For example, the education field of people in a dataset. Here, we see the IBM Human Resources And Analytics Employee dataset. Multiple columns are categorical, such is the case of business travel, department, and education field.

4. Categorical data

A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. It may be divided into groups. Examples of categorical variables are race, gender, hometown, age group or range, educational level, and the types of movies they like.

5. Anonymizing categorical data

We can replace the PII data for different values following the same discrete probability distribution that the original data has, maintaining data utility for analysis.

6. Anonymizing categorical data

The original dataset is on the left. On the right, the resulting dataset after sampling from the probability distribution of the educationField column. While the distribution is similar, the values are not necessarily the same for each individual.

7. Sample from data

Many organizations such as the U.S. Census publicly releases samples of data that they collect about citizens. These datasets are sanitized, meaning anonymized or pseudo-anonymized. Then a tiny sample is released to enable others to make calculations for large-scale statistical patterns, for example, to learn averages, variances, clusters while obfuscating small-scale information, such as the case of a salary of a particular person.

8. Explore the distribution

Let's explore the distribution of the different education fields this dataset has, using the "value counts" method. This obtains the counts of unique entries in a column. We see that Life Sciences are the most frequent field, followed by Medical, and so on.

9. Explore the distribution

We can also display the distributions of the columns by generating a barplot using the plot method after value_counts() specifying the argument for the kind parameter as bar.

10. Explore the distribution

From the value counts, we can obtain a sequence with the indexes of the unique values. The indexes, in this case, are the names of the education fields themselves. Based on these we will calculate the probability distributions.

11. Explore the distribution

We can obtain the relative frequencies of each unique value in the education field column by setting the normalize parameter in the value counts() method to true. These are the probabilities with which a random variable can take on each education field. For example, there's a 41% probability that if you picked a random row, that row would have an education field of "Life Sciences".

12. Explore the distribution

We can select only the values by accessing the values property of the resulting pandas Series obtained from the value counts method.

13. Sampling from the same distribution

We can use the random choice function from numpy to randomly sample from our distribution of education fields. We pass in the indexes as the sequence to sample from, which in this case are the category names, then we pass the relative frequencies as the probabilities associated with each category to the p parameter, and finally we specify the number of desired samples to be generated with the size parameter, in this case the same as the dataset.

14. Sampling from the same distribution

With value_counts() we can compare the frequencies of the resulting sampled dataset which are situated here on the right, with the original ones situated on the left. These differ very little from the original ones, maintaining the same distribution.

15. Let's practice!

Now it's your turn. Let's practice.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.