1. Data masking and data generation with Faker
Welcome back!
2. Quasi-identifiers
In the last video, you learned about sensitive and non-sensitive PII. Non-sensitive personally identifiable information alone doesn't directly identify someone, but it can uniquely identify someone if combined with other personal information.
3. Quasi-identifiers
These are known as quasi-identifiers. When combined, they can become personally-identifying information.
Quasi-identifiers can include age, gender, occupation, and personal dates.
4. Quasi-identifiers and re-identification attacks
When quasi-identifiers are combined with other information from other data sources, with the intent to discover a subject, this is known as a re-identification attack.
Here we see how two datasets can be linked. According to a study of the 1990 census data, 87% of the US population can be uniquely identified by gender, ZIP code, and full date of birth.
5. More anonymization techniques
For this reason, it's important to extend anonymization techniques to all the quasi-identifiers and ensure that we reduce data disclosure risks. Remember that a way to try to protect privacy in a dataset is to apply suppression on a dataset.
6. Data masking
When we suppress sensitive information data, we are fully removing them by either deleting them or by fully replacing them with characters.
Replacing sensitive values with other characters like "x" or even random characters, is also known as data masking or pseudonymization. Different types of data transformation functions can be used.
7. Data masking
Here we have a dataset containing users' card numbers, countries, and emails. We will mask the card numbers.
8. Data masking
We can fully mask the DataFrame in a uniform way by replacing the information with an encoded version of it, like repeated asterisk characters. Here, mask card_number with four asterisks.
9. Partial data masking
A type of generalization is partial masking. For instance, masking the username part of e-mails. Here we can see how Facebook does this when trying to reset a password.
10. Partial data masking
To mask the user names of the emails from our DataFrame, we can apply a lambda function the following way.
If you haven't yet encountered lambda functions, they are short anonymous functions that can be passed as parameters.
Remember, you can treat strings as arrays. So for each email string s, first we will add the first character of the e-mail, then the asterisk characters we wish to use for masking. We then add the remainder of the address by using the find method to slice the string from the @ symbol onwards and add this after the asterisks.
Now the emails don't disclose any personal information.
11. Generating synthetic data
Another way to mask is to generalize data and replace it with other information similar to the original information.
For example, generate fake card numbers that have the same pattern and characteristics as the original ones.
12. Generating synthetic data
To generate new data, you can use the Faker package.
To begin, import the faker class. Then, use the faker constructor to create and initialize a faker generator, which can generate data with methods named after the type of data you want.
To get a new card number, use the credit_card_number method. Every time it's run, it will generate a new real-looking number.
13. Generating synthetic data
Replace the original values from the card_number column by applying a lambda function.
Here we pass in a lambda function to the apply method of the Card_number Series. The lambda function can be read as for each element x from the Card_number variable, create a new number.
Printing the head, we see the newly pseudonymized DataFrame, with a column containing fake credit card numbers.
14. Generating other types of data
You can also generate and pseudonymize other data, like names using name(). And generate them depending on the gender using name_male() for male and name_female() for female names.
15. Time to practice!
Let's practice!