Safely release datasets to the public

1. Safely release datasets to the public

Hello! In this video, you will learn to combine techniques you learned throughout the course to prepare datasets for release.

2. Exploring datasets

When it comes to analyzing the possible privacy concerns of a dataset we want to share, it's best to first acquire domain knowledge of it. Here we have a large dataset named Health Insurance Cross Selling. The id attribute seems to be PII. There are also quasi-identifiers such as age, gender, and unique code for the region of the customer.

3. Exploring datasets

With the function info from pandas, we can obtain a concise summary of a DataFrame. This method prints information about a DataFrame including the column names, the data types, and non-null values. Here we see that most of the columns don't have any null values.

4. Exploring datasets

We can use the "nunique()" method to find the number of unique values within each column in a DataFrame. From the output, we confirm that ID is unique per row, since there are 381109 unique values, the same as number of rows in the DataFrame. We will consider a column categorical if it has less than 20 unique values. There are some categorical variables such as gender and Driving_License where values are 0 or 1, and Vehicle_Age that is represented with string ranges, among others.

5. Suppressing unique attributes

For the ID column, a practical approach is to suppress it. This is because it doesn't really provide relevant information for prediction tasks, and instead is used to uniquely identify a customer. We can suppress it with the drop method from pandas. Passing the name of the column as the first argument and columns as the axis.

6. Cleaning data

With the dropna method from pandas we remove missing values, such as not a number values, setting the axis to be index.

7. Sampling from categorical values

As seen in chapter 2, we can sample from the original probability distribution. Here we compute the distributions using the value_counts() method and set normalize to True for obtaining the percentages of gender appearance in the dataset. We obtain a 54 percent probability for a male to appear in the dataset and a 45 percent probability for a female.

8. Sampling from categorical values

Here we obtain the distributions. With the choice function from the random module of numpy, we generate and replace gender values. We pass the names of the distribution, in this case, female and male. Then pass the probability distributions and the size of the sample to be the same as the one in cleaned_df.

9. Sampling from categorical values

Since the dataset is considerably large with 831109 rows, when sampling, the resulting dataset will follow the distribution very precisely.

10. Sampling from categorical values

We can inspect this by computing the distributions in the resulting sampled values. We obtain 54 point 19 percent for a male to appear, in comparison with 54 point 0 percent in the original dataset. The same with female appearance, the difference is very little.

11. Removing column names

Finally, we can replace the column names with a list of numbers. Use range to generate a list of numbers from 0 to the total number of columns in the dataset. The resulting dataset can be considered private enough to be released for global data analysis purposes, for example, to calculate the mean or variance.

12. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Data Privacy and Anonymization in Python

AdvancedSkill Level

4.9+

39 reviews