Get startedGet started for free

What's private, and why do we care?

1. What's private, and why do we care?

Hello, and welcome to the course!

2. Facebook and Cambridge Analytica scandal

Many of you probably heard about the Facebook Cambridge Analytica scandal, where Facebook gave unauthorized access to 87 million people's personal information used to identify psychological profiles of American voters and influence their voting behavior. This understandably led to a massive public relations crisis, as well as massive fines.

3. What's privacy?

As you might imagine, the core of this scandal is the concept of privacy. Privacy is about how information flows. Think about your location. When you walk on the street, you are sharing your face and location with people around you.

4. Information flow and privacy

Then why might facial recognition or location tracking software make you feel uncomfortable?

5. Information flow and privacy

What bothers us is how personal information flows; if it flowed in a way that we didn't expect or are afraid of. We can thus define privacy as the ability to ensure flows of information that satisfy social and legal norms.

6. Personally identifiable information (PII)

A key concept in privacy is that of personally identifiable information or PII. It is data that, when used alone or with other relevant data, can identify someone.

7. Sensitive PII

There are two types of PII. Sensitive PII is those that are clearly about a particular person, which loss or disclosure without authorization could result in harm, embarrassment, or inconvenience.

8. Sensitive PII

Examples include full names, Social Security Numbers, financial information, or medical records. It must be inaccessible to outside parties unless granted permission. If not, regulations such as the General Data Protection Regulation, or GDPR, can fine companies for up to €20 million or 4% of their annual turnover.

9. Non-sensitive PII

Non-Sensitive PII is data that cannot be used alone to trace a person, such as gender, occupation, zip code, or city of birth.

10. Non-sensitive PII

However, these examples contain quasi-identifiers that can still be used with other PII to identify someone.

11. GDPR: EU General Data Protection Regulation

Both sensitive and non-sensitive PII are protected under GDPR, which protects PII of people living, or whose data is processed within Europe. Its core goal is to provide people with control over how their data is collected and used. The GDPR's key principles are that PII is processed lawfully, only used for specified purposes, restricted to what is necessary, is accurate, and stored for a limited time. For more information, you can visit the link shown on the slide.

12. Data suppression

There are a variety of ways that we can protect PII. A basic anonymization technique is suppression. This refers to removing selected information to protect the privacy of subjects. It can be attribute suppression, where columns are entirely removed. It can also be cell or record suppression, removing records with sensitive or unique attribute values that can disclose protected information.

13. Attribute suppression on a dataset

Here we perform attribute suppression on names of the 2020 White House Salaries. First, we use the drop method from pandas. As a first argument pass the column to be removed; and as a second argument the axis of the data, in this case, columns. This technique should be applied when an attribute is not required in the anonymized dataset.

14. Record suppression on a dataset

For record suppression, a common procedure is to identify outliers to remove. Here is a clear outlier, the salary of over two million dollars.

15. Record suppression on a dataset

Again use drop, and pass the conditional on dropping all salaries greater than 2000000, selecting the index of those rows that meet that condition.

16. Suppression and linkage attacks

Suppression can fail for linkage or re-identification attacks. That's when your dataset is linked to other public data sources. Here, even after suppression, we can link the voter registration data with medical records and discover Alice's disease. We will see approaches to deal with this later.

17. Let's practice!

Let's practice!