Datasets with the same probabilistic distribution
The goal of synthetic data is to create a dataset that is as realistic as possible, and does so without endangering important pieces of personal information. For instance, a team at Deloitte Consulting generated 80% of the training data for a machine learning model by synthesizing data. The resulting model accuracy was similar to a model trained on real data.
In this exercise, you will generate a synthetic dataset from scratch using Faker that follows a probabilistic distribution loaded as p.
The Faker generator fake_data has been already initialized and numpy is imported as np.
Latihan ini adalah bagian dari kursus
Data Privacy and Anonymization in Python
Latihan interaktif praktis
Cobalah latihan ini dengan menyelesaikan kode contoh berikut.
# Obtain or specify the probabilities
p = (0.46, 0.26, 0.16, 0.1, 0.02)
# Generate 5 random cities
cities = ____
# See the generated cities
print(cities)