1. Differentially private clustering models
Let's now learn to build differentially private k-means models.
2. Comparing models
Here, we can see clustering done on 3000 records, using a non-private K-means model on the left.
Then here are the resulting clusters with a private model. The training time is the same in both. Adding differential privacy doesn't increase computation time.
However, the inertia in the clusters of the private model is bigger, with a difference of almost 800. Remember, the goal of k-means is to minimize the inertia since it's the sum of squared errors for each cluster.
3. Comparing the models
Here, we see the difference in the resulting clusters between the non-private and private models indicated by the purple points.
Although some elements weren't segmented into the same group, shown as purple points, most of the observations were, which means that we preserve privacy with the private model while maintaining most of the resulting clusters in common with the non-private one.
4. Building differently private clustering models
As we saw earlier, we can create different types of differentially private models with the Diffprivlib library.
We import the k-means model.
Having the data loaded as "X", we perform clustering on it, using the private model.
By default, the value of the privacy parameter epsilon is 1. We keep it this time and set the number of clusters 3 since, in this example, we already know that the dataset has 3 big centers of data points.
We can run the model with dot fit_predict and obtain the clusters.
5. Improving DP clustering models
We can improve the results of the private model by pre-processing data before clustering. For example, by applying feature scaling and dimensionality reduction methods like PCA.
This will reduce the inertia of the model and get more accurate segmentation groups.
Just like you can do with sklearn models.
6. Improving DP clustering models
We import PCA from the decomposition module of Scikit-learn.
Initialize the PCA class with the constructor.
Call the fit_transform method by passing the dataset as an argument. It will fit the data and then transform it based on calculations to find an optimal number of features to focus on and ignore the rest.
Then we perform clustering with a private K-means model.
7. Improving DP clustering models
When we plot the resulting clusters, we can see that it can get a smaller inertia value, with a value of 2836, and segmentations closer to the non-private model and aimed clusters.
8. Improving DP clustering models
We visualize that the difference between the resulting clusters is smaller. Thus, the private model has more similar results to the non-private model.
9. Elbow method
We can also use the elbow method to calculate the optimal number of clusters on private models, although it will hold noisy data.
We have to select the value at the "elbow", the point after which the distortion or inertia start decreasing slower or in a linear fashion.
Here we see the result of applying the elbow method with the private model. We can still consider 4 as a possible optimal number of clusters.
10. Epsilon
As with other differentially private models, decreasing epsilon increases privacy and restriction, resulting in more distorted results.
Here we set the privacy value epsilon to 0 point 2.
Similar to the previous video, we can also set the bounds as privacy parameters to avoid data leakage about the min and max values in data.
11. Epsilon
We see that the resulting clusters are moved in comparison with the scikit-learn model without differential privacy noise added. We can also notice the inertia is bigger, with a difference of almost 2000.
Just like we saw before, more privacy normally incurs less accuracy, especially in smaller datasets.
12. Let's practice!
Let's practice!