Get startedGet started for free

Anonymizing with data generalization

1. Anonymizing with data generalization

Hello again!

2. Data generalization

Data generalization is a technique used to replace a data value with a less precise one by applying operations like binning, rounding, categorizing in broader concepts, etc. The purpose is to eliminate identifiers while retaining data utility for analysis. For example, replacing all occupation values of "Dancer" with "Artist". Here is another example with addresses. It keeps the zip code while generating a synthetic street in the same zip code, using Faker.

3. Data generalization

You can use a value range in place of a numerical value. Here, we do it with Age. Those who are 34 years old will be placed in a range of 30 to 50. This is also known as binning. For SSN, the sensitive data is partially masked.

4. Data aggregation

Data aggregation is like generalization, but where data is presented in a grouped format, resulting from a relationship between its attributes. Here we categorize a dancer, writer, and singer as artists and astronomers, computer scientists, and biologists as scientists.

5. Medical dataset

Here, we have a dataset containing age and medical condition of people that work in a company.

6. Medical dataset

By exploring the age distribution with a histogram using the hist method, we see that most of the people are in their 40s. With the bins argument you can set how many bins to use to represent the whole histogram data. The more bins, the more detailed the histogram. Here we set it to 15. The default value is 10.

7. Generalization

You can apply generalization by transforming data into binary columns, applying a lambda function to the column, with an if else condition. Everyone who is older than 40 will be set to 0 or to a string of ">=40", while people younger will be 1 or "<40".

8. Top and bottom coding

An interesting technique is top and bottom coding. We see that only few people are younger than 25 or older than 55. These are outliers, and people in these outliers are easier to identify. With top and bottom coding, we can apply bounds to reduce the risk of re-identification for them. It's best to use when there are very few observations within a category, especially at the tails of the distribution.

9. Top coding

Let's filter for those who are 55 or older. Only 6 people meet the condition. And if there's only one female in the Finance department who is 58 years old, we now know that she has Dysthymia. This shows how outliers are easier to identify.

10. Top code

Based on the histogram and filtering, we can top code everyone above 55 years of age. Using the loc method on the DataFrame, set the condition that every person who is 55 years or older will have that top-coding value of 55. Now all those subjects are recorded as 55 years old and we can't re-identify someone in this set.

11. Bottom coding

We can do the same for outliers on the other side of the histogram, this is referred to as bottom coding. Here is the histogram after the top coding we did previously. Let's now bottom code with 25.

12. Bottom code

Apply bottom coding the same way we did before with top coding. Specify that for all people younger than 25, their age should be recorded as 25. The resulting histogram has the outliers removed. This way we still keep the majority of the original data, and thus, its utility for analysis.

13. Data generalization and privacy models

Generalization is better when used with suppression and masking following a "Privacy Model" like K-anonymity. Privacy models specify conditions that the dataset must satisfy to keep disclosure risk under control. We'll learn how to implement it later.

14. Time to practice!

Time to practice!