1. Generating realistic datasets with Faker
In this video, you will learn how to generate realistic datasets whose values are congruent along with their rows and columns.
2. Generating data with Faker
As we have seen in the first chapter, we can generate data with the Faker package.
Here we generate a fully random name, which could be male or female.
We can also generate a male or female name using the corresponding functions.
3. Clients DataFrame
This is a dataset that has columns for the gender and active status of clients from a software as a service company.
4. Generating a dataset with Faker
We will generate a dataset using faker. First generating unique names that are consistent with gender.
Then random cities, specified cities that follow a probability distribution, emails and dates.
5. Making names match their gender
Let's start with names.
First, we need to import the Faker class. Then we initialize a faker generator with its constructor.
To generate names based on gender, we can use list comprehensions to specify the condition. List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list.
To create unique names, we can access the unique attribute of the Faker class to invoke the functions from. This attribute will ensure the values are not repeated in the same Faker generator instance.
Here we return a female name if the gender is female and otherwise we return a male name, for every row in the gender column.
We assign this resulting list to a new column called name in clients_df.
6. Making names match their gender
The resulting dataset will match the names with the client's gender. The names will be unique in the column.
7. Generating a random city
We can also generate a city for every element in the clients_df. We calculated
the dataset size with len() and created a range list from it.
8. Generating emails
We can also generate random emails with company-like domains, instead of regular free emails such as Gmail, with the company_email function from Faker.
Here we generate one company email for each row in clients_df.
9. Generating emails
If we want to generate emails that are similar to the fake names, we can do so by iterating through the name column of clients_df. This time we use the replace method to remove the space between the first and last name, and then we concatenate an at symbol. We finally concatenate a generated domain name using the domain_name function from Faker.
10. Generating dates
We can generate random dates between two different times with the function date_between. Imagine that the company has only been active for 10 years. We would then specify the start date to be 10 years ago.
For that, we pass a string value to the start_date parameter with a minus sign in front of 10 and finally the character "y" representing years. And for the end_date parameter, we pass the string "now" as the argument.
11. Generating cities following a probabilistic distribution
If we imitate a real dataset, we can add another degree of privacy by not leaking the real names of values.
As we have seen in chapter 2, we can sample from specified values following a probability distribution function.
Here we import numpy, set the probabilities as a tuple. Each element refers to the likelihood of appearing in the dataset and then specifies the four cities the company is based. We can either create these probabilities or get them from another dataset.
With the function choice from the random module of numpy, we pass the cities, the desired sample size, and the probability distributions as arguments.
12. Generating cities following a probabilistic distribution
Now the dataset will only have cities from that selection.
13. Let's generate datasets!
Let's generate some datasets!