Datasets with the same probabilistic distribution
The goal of synthetic data is to create a dataset that is as realistic as possible, and does so without endangering important pieces of personal information. For instance, a team at Deloitte Consulting generated 80% of the training data for a machine learning model by synthesizing data. The resulting model accuracy was similar to a model trained on real data.
In this exercise, you will generate a synthetic dataset from scratch using Faker
that follows a probabilistic distribution loaded as p
.
The Faker
generator fake_data
has been already initialized and numpy
is imported as np
.
This exercise is part of the course
Data Privacy and Anonymization in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Obtain or specify the probabilities
p = (0.46, 0.26, 0.16, 0.1, 0.02)
# Generate 5 random cities
cities = ____
# See the generated cities
print(cities)