CommencerCommencer gratuitement

Datasets with the same probabilistic distribution

The goal of synthetic data is to create a dataset that is as realistic as possible, and does so without endangering important pieces of personal information. For instance, a team at Deloitte Consulting generated 80% of the training data for a machine learning model by synthesizing data. The resulting model accuracy was similar to a model trained on real data.

In this exercise, you will generate a synthetic dataset from scratch using Faker that follows a probabilistic distribution loaded as p.

The Faker generator fake_data has been already initialized and numpy is imported as np.

Cet exercice fait partie du cours

Data Privacy and Anonymization in Python

Afficher le cours

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Obtain or specify the probabilities
p = (0.46, 0.26, 0.16, 0.1, 0.02)

# Generate 5 random cities 
cities = ____

# See the generated cities
print(cities)
Modifier et exécuter le code