Consistent synthetic dataset
One scenario in which companies use synthetic data is the training of artificial intelligence and machine learning models. Real-world data is sometimes expensive to collect, or simply hard to come by. When the training data is highly imbalanced (e.g., more than 90% of instances belong to one class), synthetic data generation can help build accurate machine learning models.
In this exercise, you will generate a mobile app rating dataset using Faker
.
The initial DataFrame is loaded as ratings
with two columns: rating
and gender
. A Faker()
generator has already been initialized for you as fake_data
.
This exercise is part of the course
Data Privacy and Anonymization in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Generate a name according to the gender that will be unique in the dataset
ratings['name'] = [____ if x == "Female"
else ____
for x in ratings['gender']]