Créer un classifieur avec confidentialité différentielle

Dans cet exercice, vous allez créer et entraîner un modèle privé Gaussian Naive Bayes sur le jeu de données Penguin afin de prédire si un manchot est mâle ou femelle.

La k-anonymat fonctionne mal avec les jeux de données à forte dimension ou très hétérogènes en raison de ses limites théoriques et empiriques importantes, la « malédiction de la dimension ». À mesure que le nombre de variables ou de dimensions augmente, la quantité de données nécessaire pour généraliser correctement croît de façon exponentielle. C’est l’une des raisons pour lesquelles la confidentialité différentielle est aujourd’hui le modèle de confidentialité privilégié. Epsilon est indépendant de toute connaissance préalable et « borne » les informations sensibles.

Le DataFrame est chargé sous le nom penguin_df et scindé en X_train, y_train, X_test et y_test. La classe du modèle privé a été importée sous le nom dp_GaussianNB.

Cet exercice fait partie du cours

<cours>Confidentialité des données et anonymisation en Python</cours>

Instructions de l’exercice

Créez un classifieur dp_GaussianNB sans paramètres.
Ajustez le modèle créé précédemment aux données sans aucun paramètre.
Calculez le score du modèle privé sur les données de test.

Exercice interactif pratique

Essayez cet exercice en complétant ce code d’exemple.

# Built the private classifier without parameters
dp_clf = ____

# Fit the model to the data
____(X_train, y_train)

# Print the accuracy score
print("The accuracy with default settings is ", ____(X_test, y_test))

Modifier et exécuter le code

Cet exercice fait partie du cours

<cours>Confidentialité des données et anonymisation en Python</cours>

AvancéNiveau de compétence

4.9+

Commencer le cours gratuitement

Get ready to apply anonymization techniques such as data suppression, masking, synthetic data generation, and generalization. In this chapter, you’ll learn how to distinguish between sensitive and non-sensitive personally identifiable information (PII), quasi-identifiers, and the basics of the GDPR. You'll also encounter real-life examples of what can go wrong if you don't follow these best practices.

Exercise 1: What's private, and why do we care?Exercise 2: Privacy is power Exercise 3: Is it sensitive or non-sensitive?Exercise 4: Suppression of sensitive attributes Exercise 5: Data masking and data generation with Faker Exercise 6: Masking sensitive PII Exercise 7: Removing names with faker Exercise 8: Anonymizing with data generalization Exercise 9: Reducing identification risk with generalization Exercise 10: Data aggregation and data generalization Exercise 11: Top and bottom coding White House salaries

Discover how to anonymize data by sampling from datasets following the probability distribution of the columns. You’ll then learn how to apply the k-anonymity privacy model to prevent linkage or re-identification attacks and use hierarchies to perform data generalization in categorical variables.

Exercise 1: Anonymizing categorical data Exercise 2: Explore the distribution of data Exercise 3: Sampling from the same probability distribution Exercise 4: Anonymizing continuous data Exercise 5: Different distributions Exercise 6: Sampling from the best continuous distribution Exercise 7: Introduction to K-anonymity Exercise 8: Privacy attributes Exercise 9: Generalizing into ranges Exercise 10: Generalizing data using hierarchies Exercise 11: Using hierarchies for categorical data Exercise 12: K-anonymizing a dataset

Learn about differential privacy, the model used by major technology companies such as Apple, Google, and Uber. In this chapter, you’ll explore data by generating private histograms and computing private averages in data. You’ll also create differentially private machine learning models that allow businesses to increase the utility of their data.

Exercise 1: Introduction à la confidentialité différentielle Exercise 2: Epsilon (ϵ) : le nombre magique Exercise 3: Histogrammes avec confidentialité différentielle Exercise 4: Budgets de confidentialité Exercise 5: Utiliser des budgets de confidentialité Exercise 6: Quand il n’y a plus de budget Exercise 7: Explorer des données avec un gestionnaire de budget de confidentialité Exercise 8: Modèles de Machine Learning avec confidentialité différentielle Exercise 9: Créer un classifieur avec confidentialité différentielle

Exercice actuel

Exercise 10: Prédire les salaires Exercise 11: Modèles de clustering avec confidentialité différentielle Exercise 12: Prétraiter les données Exercise 13: Segmenter les clients

In this final chapter, you’ll learn how to apply dimensionality reduction methods such as principal component analysis (PCA) to anonymize large multi-column datasets. You’ll then use Faker to generate realistic and consistent datasets, and scikit-learn to create synthetic datasets that follow a normal distribution. Lastly, you’ll tie everything you learned in this course together as you combine multiple techniques to safely release datasets to the public.

Exercise 1: PCA for anonymization Exercise 2: Anonymization of high-dimensional data Exercise 3: Data masking with PCA Exercise 4: Generating realistic datasets with Faker Exercise 5: Consistent synthetic dataset Exercise 6: Datasets with the same probabilistic distribution Exercise 7: Creating synthetic datasets using scikit-learn Exercise 8: Generating datasets for classification Exercise 9: Generating datasets for clustering Exercise 10: Safely release datasets to the public Exercise 11: Exploring and pseudonymizing a dataset Exercise 12: Preparing employee data for safe release Exercise 13: Great work!