Gegevenssets met dezelfde probabilistische verdeling

Het doel van synthetische data is om een gegevensset te maken die zo realistisch mogelijk is, zonder daarbij belangrijke persoonsgegevens in gevaar te brengen. Zo heeft een team bij Deloitte Consulting 80% van de trainingsdata voor een machinelearningmodel gesynthetiseerd. De nauwkeurigheid van het resulterende model was vergelijkbaar met die van een model dat op echte data is getraind.

In deze oefening genereer je vanaf nul een synthetische gegevensset met Faker die een probabilistische verdeling volgt die is geladen als p.

De Faker-generator fake_data is al geïnitialiseerd en numpy is geïmporteerd als np.

Deze oefening maakt deel uit van de cursus

Dataprivacy en anonimisering in Python

Cursus bekijken

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Obtain or specify the probabilities
p = (0.46, 0.26, 0.16, 0.1, 0.02)

# Generate 5 random cities 
cities = ____

# See the generated cities
print(cities)

Code bewerken en uitvoeren

Deze oefening maakt deel uit van de cursus

Dataprivacy en anonimisering in Python

SkillTag.level.advancedSkillTag.label

4.9+

Begin de cursus gratis

Get ready to apply anonymization techniques such as data suppression, masking, synthetic data generation, and generalization. In this chapter, you’ll learn how to distinguish between sensitive and non-sensitive personally identifiable information (PII), quasi-identifiers, and the basics of the GDPR. You'll also encounter real-life examples of what can go wrong if you don't follow these best practices.

Exercise 1: What's private, and why do we care?Exercise 2: Privacy is power Exercise 3: Is it sensitive or non-sensitive?Exercise 4: Suppression of sensitive attributes Exercise 5: Data masking and data generation with Faker Exercise 6: Masking sensitive PII Exercise 7: Removing names with faker Exercise 8: Anonymizing with data generalization Exercise 9: Reducing identification risk with generalization Exercise 10: Data aggregation and data generalization Exercise 11: Top and bottom coding White House salaries

Discover how to anonymize data by sampling from datasets following the probability distribution of the columns. You’ll then learn how to apply the k-anonymity privacy model to prevent linkage or re-identification attacks and use hierarchies to perform data generalization in categorical variables.

Exercise 1: Anonymizing categorical data Exercise 2: Explore the distribution of data Exercise 3: Sampling from the same probability distribution Exercise 4: Anonymizing continuous data Exercise 5: Different distributions Exercise 6: Sampling from the best continuous distribution Exercise 7: Introduction to K-anonymity Exercise 8: Privacy attributes Exercise 9: Generalizing into ranges Exercise 10: Generalizing data using hierarchies Exercise 11: Using hierarchies for categorical data Exercise 12: K-anonymizing a dataset

Learn about differential privacy, the model used by major technology companies such as Apple, Google, and Uber. In this chapter, you’ll explore data by generating private histograms and computing private averages in data. You’ll also create differentially private machine learning models that allow businesses to increase the utility of their data.

Exercise 1: Introduction to differential privacy Exercise 2: Epsilon (ϵ): the magic number Exercise 3: Histograms with differential privacy Exercise 4: Privacy budgets Exercise 5: Using privacy budgets Exercise 6: When no budget is left Exercise 7: Exploring data with a privacy budget accountant Exercise 8: Differentially private machine learning models Exercise 9: Build a differentially private classifier Exercise 10: Predicting salaries Exercise 11: Differentially private clustering models Exercise 12: Pre-processing data Exercise 13: Segmenting customers

In this final chapter, you’ll learn how to apply dimensionality reduction methods such as principal component analysis (PCA) to anonymize large multi-column datasets. You’ll then use Faker to generate realistic and consistent datasets, and scikit-learn to create synthetic datasets that follow a normal distribution. Lastly, you’ll tie everything you learned in this course together as you combine multiple techniques to safely release datasets to the public.

Exercise 1: PCA voor anonimisering Exercise 2: Anonimiseren van hoog-dimensionale data Exercise 3: Datamasking met PCA Exercise 4: Realistische gegevenssets genereren met Faker Exercise 5: Consistente synthetische gegevensset Exercise 6: Gegevenssets met dezelfde probabilistische verdeling

Huidige oefening

Exercise 7: Synthetische gegevenssets maken met scikit-learn Exercise 8: Gegevenssets genereren voor classificatie Exercise 9: Gegevenssets genereren voor clustering Exercise 10: Gegevenssets veilig openbaar maken Exercise 11: Een gegevensset verkennen en pseudonimiseren Exercise 12: Werknemersdata voorbereiden voor veilige publicatie Exercise 13: Goed gedaan!