1. Differentially private machine learning models
Welcome back!
2. Sharing data safely
Following a differentially private approach with a small value of epsilon or "privacy budgets" can allow companies with similar data to share information with each other safely. This including Machine Learning models.
3. Differently private machine learning models
Imagine a software as a service company that has multiple online stores as partners.
When a new partner joins, it might take months before it can collect enough data to develop and apply machine learning products to its business.
With differential privacy, the company can encourage current partners to share their data with new partners with similar clients, hoping that it will be anonymized and safe.
4. Machine learning and privacy
In many applications of machine learning, the datasets contain sensitive information. By exploiting the output of machine learning algorithms, an adversary may be able to identify some individuals in the dataset.
5. Machine learning and privacy
Differential privacy ensures that attackers cannot confidently extract private information about individuals.
Differentially private machine learning models can still capture some underlying distributions corresponding to large datasets while still guaranteeing privacy with respect to individuals in data.
6. Differentially private classification models
With the models module from diffprivlib package, we can train and test machine learning models the same way you would with Scikit-learn, while satisfying differential privacy.
Here we import a Gaussian naive Bayes classifier from both packages. First from scikit-learn and then from diffprivlib.
7. Non-private classifier
To compare, let's run a non-private Naive Bayes model from Scikit-learn on the Heart Failure dataset.
We create the model with the GaussianNB constructor. Then fit the model to data and finally run it to obtain the prediction score with testing data, using the dot score method.
It gets an accuracy score of 83 point 33.
8. Differentially private classifier
Now, we will import the GaussianNB class as dp_GaussianNB, and run the private classifier by initializing with the empty constructor this time.
If we don't specify any parameters, the model defaults to set epsilon to 1 and selects the model feature bounds from data.
Fit it using the fit method, and run it to see prediction score with dot score.
The accuracy is 70 percent.
The test accuracy will change every time that is run, due to the randomness introduced by differential privacy.
This throws a warning when dot fit is called, as it leaks additional information: it reveals the min and max real values from original data.
9. Avoid privacy leakage
To avoid data leakage, we can replace the min and max values by passing a bounds argument. This argument is a tuple of the form (min, max) with integers covering the min/max of the entire data or arrays for min and max values of each column in the data.
10. Avoid privacy leakage
Here we select the values of each feature in our dataset and round them in 1. For min(), we make the values lower than the minimum values, and for max() overpass the maximum values.
We set epsilon to 0 point 5 when creating the model, and pass bounds as the argument for the bounds parameter.
We obtain 80 dot 70 percent accuracy.
11. More on adding bounds
We can also add more randomized bounds. We need to import the random module. Here we create a sample of random numbers from 0 to 30, to be added or subtracted to the bounds so it's harder to know the real min and max values of each of the 12 columns.
We create and run the model and obtain a score of 75 dot 44 percent.
12. Different epsilon values
We can see the tradeoff between accuracy and privacy (epsilon), by comparing different epsilon values from 0 point 01 to 10 and their scoring results, plotting it with matplotlib.
13. Let's create privacy preserving models!
It's your turn to create private models!