1. Introduction to differential privacy
Hello. In this video, you will learn about differential privacy.
2. What is differential privacy (DP)?
Imagine we perform a survey where we ask people: Do you dye your hair?
In the case of Alex, she does. She says yes.
3. What is differential privacy (DP)?
Before sending the response to the database, the system adds noise: it flips a coin.
4. What is differential privacy (DP)?
If it is heads
5. What is differential privacy (DP)?
it sends the real answer.
6. What is differential privacy (DP)?
If it is tails,
7. What is differential privacy (DP)?
it flips the coin a second time.
8. What is differential privacy (DP)?
Then it will send "no" if is heads and "yes" if is tails.
We collect data, but because of the added noise, it's going to be harder to track Alex as a person with dyed hair because there's a 1 in 4 chance that the answer was the effect of the coin toss.
9. What is differential privacy (DP)?
In essence, differential privacy is a mathematical definition of privacy, turned into a system that processes information to protect individuals.
10. Who uses differential privacy (DP)?
Apple is one of the biggest companies that use this approach.
11. Who uses differential privacy (DP)?
What websites affect the phone's battery life? Which emojis are chosen most often? What new words are trending? Apple uses differential privacy to know this, anonymously.
12. Global differential privacy
One type of differential privacy is global, where the noise is added when the data is queried.
The data curator protects user privacy from third-parties who are querying the database. It's generally more accurate: all the analysis happens on raw data, and only a small amount of noise is added to the end of the process.
13. Local differential privacy
Another type is local, where there is no trusted party; each person adds noise to their own data before sharing it. As in the example of the coins, we send data that has already been injected with noise.
14. Epsilon-differential privacy
Differentially private systems are assessed by a value represented by the Greek letter epsilon. It measures how private, and how noisy, a data release is.
Higher values of epsilon indicate more accurate, less private answers. Low-epsilon systems give highly random answers that don't let possible attackers learn much at all.
15. Epsilon is exponential
Epsilon is exponential: a system with epsilon = 1 is almost three times more private than epsilon = 2, and over 8000 times more private than epsilon = 10.
16. K-anonymity and differential privacy
While k-anonymity is widely used, in most cases isn't sufficient. Differential privacy is newer, preferred by companies nowadays, and exactly quantifies privacy degradation.
17. Introduction to diffprivlib
We can apply differential privacy with the diffprivlib library from IBM.
One application is to explore data by generating private histograms, using code almost identical to what we can use to generate histograms with NumPy.
18. Histograms
We can find the distribution of data without applying differential privacy, using a histogram function from Numpy.
Counts is an array of histogram values and bins the array with data represented in ten equally-spaced bins.
To get the histogram height proportions, we can normalize the values by dividing "counts" by the total sum of these.
Then plot the resulting histogram using the bar function from Matplotlib, passing the bins, except the last one, because bins denote the edges and thus is one larger than counts. Then height proportions, and the width of each bar, obtained by subtracting a bin's elements from its previous one.
19. Histograms
Here we see the resulting histogram.
20. Private histogram
The histogram function from diffprivlib is a useful tool to visualize data in a differentially private way. Import the tools module. The syntax is the same as using NumPy, with the addition of an epsilon parameter. Here we set it to 0 dot 1, meaning that it will be noisier but more private as well. The default value is 1.
21. Private histogram
On the right, the resulting histogram is slightly different.
22. Let's practice!
Let's practice!