1. Introduction to K-anonymity
Hello.
2. Why is k-anonymity important?
In 1997, Latanya Sweeney conducted a re-identification experiment wherein she successfully linked then Massachusetts governor, William Weld, to his medical records using publicly accessible records.
3. Why is it important?
Governor Weld lived in Cambridge. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his ZIP code.
On the right we see a diagram showing how re-identification attacks work, by linking data with other information to trace an individual. This was the case of the governor, using medical data and a public voter list.
4. How to prevent this attack?
In this case, only one record had the demographic characteristics of the governor.
A possible solution to avoid this could be to delete all the demographic information. But this would leave the data useless for analysis.
What if there is a middle ground to make sure that the characteristics are no longer unique in the dataset?
5. Definition of k-anonymity
That leads to K-anonymity. A privacy model commonly applied in data sharing scenarios.
For k-anonymity to be achieved, there needs to be at least k individuals in the dataset who share the set of attributes that might become identifying for each individual. This is normally achieved by suppressing and generalizing quasi-identifiers.
6. Definition of k-anonymity
K-anonymity might be described as "hiding in the crowd": if each individual is part of a larger group, then any of the records in this group could correspond to a single person.
7. K-anonymous dataset
A dataset is said to be k-anonymous if every combination of values for identifying or quasi-identifying columns in the dataset appears at least for "k" different records.
For example, the dataset on the left is 2-anonymous. While the one on the right isn't because in the two last rows there are zip codes that appear only once.
8. K-anonymity: Terminology
An important term is sensitive attribute. This is an interesting value to be released or studied and which we assume the individual has an interest in keeping private.
These can include salaries, diagnoses, etc.
K-anonymity will only anonymize quasi-identifiers, while leaving sensitive attributes intact.
9. Medical data dataset
Let's make a dataset k-anonymous. Here, we have a medical dataset that includes information of people in a company such as their age, department and health condition.
10. Privacy attributes
Privacy attributes can be:
Identifying, which could be SSN, Quasi-identifying, such as the age and department and Sensitives, like the medical condition.
11. Exploring unique combination of quasi-identifiers
Let's take a look at the counts of unique combinations of our quasi-identifying attributes: age and department.
We can calculate them using first the groupby method from pandas. Then we use size, followed by reset_index. This last method resets the indexes of the DataFrame to fit the new resulting one and also allows us to name the newly created column that will hold the counts for each combination of age and department. We see a lot of unique combinations with a count of 1.
12. Approach: generalization
To generalize the age attribute, we can turn them into intervals, using the cut method. The first parameter will be the column data and the second parameter bins will be the number of intervals we desire. Let's choose 4.
The intervals were added to the column "Age_group".
13. Approach: generalization
Checking the unique combinations again with Age_group, we don't see any with counts fewer than two.
14. Approach: generalization
We can filter and confirm that there are not unique combinations in the dataset by filtering in the column counts those to be less than k, for which the value here is two.
No rows appeared, meaning that the dataset is 2-anonymous.
15. Let's k-anonymize!
Let's k-anonymize!