Get startedGet started for free

Generalizing data using hierarchies

1. Generalizing data using hierarchies

Hi!

2. Data generalization

As we have seen before, we can generalize data by replacing it with a less precise value. For example, replacing all occupation values of "Dancer" with "Artist", "Archaeologist" with "Scientist" or "Allergist" with "Doctor".

3. Data generalization with hierarchies

This type of data generalization that uses a type of aggregation transformation, can be thought of as a generalization with hierarchies.

4. Refugee status dataset

Here we have a dataset containing information about the refugee status of people in a city. The quasi-identifiers are nationality and gender. The refugee status and the year of the request are considered sensitive attributes interesting for analysis.

5. Exploring the dataset

Remember we can explore the counts of unique combinations of quasi-identifiers in the data. Here we calculate them based on nationality and gender. Then we filter for calculated combinations that appear fewer than k times, defined as 4. We see that the dataset isn't 4-anonymous and that there are even unique rows as in the case of the only Russian male.

6. Data generalization with hierarchies

We can generalize these types of discrete variables such as the nationality column by using hierarchies. Each nationality or origin will be replaced by the continent it belongs to. With value_counts we see the different nationalities and their frequencies in the data.

7. Data generalization with hierarchies

We can create hierarchies using dictionaries. The important factor is to associate the countries to their corresponding continent. We associate them using each continent as a key and the countries as a list value.

8. Data generalization with hierarchies

But for easy mapping later on, when replacing the values in the dataset, it's better to associate each of the countries with one continent, one by one. In other words, creating a dictionary where the countries are the key. We iterate on the items of the created hierarchies dictionary and for each country in their list of countries that each continent has, then add the country as a key and assign the continent as its value. We see how each of the countries now has a value of their corresponding continent.

9. Data generalization with hierarchies

Now we can map each of the countries to its continent and replace them, using the method dot map. On the nationality column, we use dot map and pass the country-orientated hierarchy dictionary. We see the nationalities generalized in the column we assigned as "Nationality_generalized".

10. Data generalization with hierarchies

We can filter again, to check if every combination of values for the nationality_generalized and Gender columns in the dataset appears at least for "k" different records. By obtaining an empty array, we confirm that there are no combinations that appear fewer than 4 times in the data. When releasing this dataset to the public or third-parties we can use this approach for protecting individual's privacy.

11. More on K-anonymity

K-anonymity is a very well-known privacy model that can work well for certain small dimensional datasets. In these lessons we have applied what's known as a type of constraint k-anonymity. There are other types of approaches for the privacy model that are more complex, such as the Mondrian Multidimensional approach which can prevent more types of attacks.

12. How safe is K-anonymity?

The problem of finding an optimal strategy for k-anonymity is hard. k-anonymity can be susceptible to other attacks such as the Homogeneity attack, where sensitive attributes in a k-anonymous set can have a lack of diversity and leak sensitive information about someone. Here we see that since Bob belongs to a group of people that have the same disease, we know for sure he has a particular disease. Although is beyond the scope of this course, a solution for this is L-diversity, another privacy model that's usually implemented with k-anonymity.

13. Let's practice!

Let's practice!