Datasets with the same probabilistic distribution
The goal of synthetic data is to create a dataset that is as realistic as possible, and does so without endangering important pieces of personal information. For instance, a team at Deloitte Consulting generated 80% of the training data for a machine learning model by synthesizing data. The resulting model accuracy was similar to a model trained on real data.
In this exercise, you will generate a synthetic dataset from scratch using Faker
that follows a probabilistic distribution loaded as p
.
The Faker
generator fake_data
has been already initialized and numpy
is imported as np
.
Cet exercice fait partie du cours
Data Privacy and Anonymization in Python
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Obtain or specify the probabilities
p = (0.46, 0.26, 0.16, 0.1, 0.02)
# Generate 5 random cities
cities = ____
# See the generated cities
print(cities)