PCA for anonymization
1. PCA for anonymization
Welcome back!2. Principal component analysis (PCA)
As you have seen previously, principal component analysis, or PCA, is a dimensionality-reduction method often used to reduce the dimensionality of large datasets.3. Data masking with PCA
In chapter 1 we learned about data masking. With PCA we can also perform data masking. Imagine a dataset of beers. Using PCA, we can construct new principal components that summarize our beers dataset well. For example, a new feature might be computed as 2 times alcoholic volume minus beer bitterness level or other similar combination.4. Data masking with PCA
Here we see how PCA projects data in different ways by creating these features. Each dot shows one particular beer. The properties in the x and y axis are correlated. A new property can be constructed by drawing a line through the center of this beer cloud and projecting all points onto this line. We can see what these projections look like for different lines. Red dots are projections of the blue dots.5. Data masking with PCA
PCA without the dimensionality reduction is only a rotation of the original space in the dataset. Each principal component is a rotation of the existing features, and thus an advantage for predictive tasks. We will mask data this way.6. Data masking with PCA
If those resulting numeric values are released without an explanation of how they were calculated, machine learning models could still be trained on these values and make accurate predictions, and adversaries would not know how to interpret those masked values.7. Data masking with PCA
Let's apply data masking with PCA on the Heart Disease UCI dataset. It contains 14 columns related to personal medical information. The target column refers to the presence of heart disease in a patient.8. Data masking with PCA and Scikit-learn
We prepare the data by separating the target column, in this case, the label of the prediction of heart disease. x_data is the data without the target column. Use the drop method to remove that column. "y" will be the target values, obtained by accessing the values property.9. Data masking with PCA and Scikit-learn
Import the PCA class from the decomposition module of Sklearn. Then initialize it setting the number of components to be the same as the number of columns of the x_data, by specifying the n_components parameter. We calculate the number of columns with the function "len". Apply PCA with fit_transform and pass the x_data as an argument. We will only apply PCA in the x_data and not the target column that we aim to predict later.10. Data masking with PCA and Scikit-learn
The resulting transformed data will be a NumPy array.11. Data masking with PCA and Scikit-learn
We create a DataFrame with the constructor from pandas. There are 13 principal components as we specified in the n_components parameter.12. Data utility after PCA data masking
Let's see how much data utility there is after applying data masking with PCA. For that, we will perform classification with logistic regression on both the original and resulting data to see if there is an accuracy loss. Logistic regression is a classification algorithm used to predict a binary outcome.13. Data utility after PCA data masking
Here we split the resulting dataset into training and test data using train_test_split from sklearn. It splits arrays or matrices into random train and test subsets. Create the logistic regression model and specify a maximum number of iterations of 200 with max_iter. Then fit and train the model passing the training data. Lastly, run the model and obtain the prediction scores with the score method.14. Data utility before PCA data masking
Let's do the exact same we did but with the original data x_data, transformed to a numpy array for splitting it into train and test data. We obtain the same accuracy we obtained with PCA transformed values. Meaning that there was no accuracy loss.15. Let's practice!
Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.