1. Anonymizing continuous data
Hi! In this video you will learn how to anonymize continuous data, sampling from the best distribution.
2. Continuous variables
Continuous variables are variables that have an infinite number of values between any two values. A continuous variable is numeric or date and time. Some examples can be age, height, weight, temperature, and the date and time a payment is received.
3. Continuous variables
Here we have the IBM dataset. From these selected columns, age is considered to be continuous, since it doesn't fall into a limited number of categories.
4. Continuous distributions
To sample data in the most realistic way possible, we need to select a continuous distribution that is similar to the original column data.
For that, we will:
Create a histogram using a pre-defined number of bins.
Next, we have to try different continuous distributions, such as the normal distribution or the exponential distribution. Then, fit those continuous distribution functions to the histogram. This fitting process also yields the parameters for the continuous distribution function.
We will keep the function with the slightest error (the smallest residual sum of squares error) between itself and the histogram for approximating the continuous variable and finally sample from it.
5. Continuous distribution
With Scipy, it's possible to try different distribution models. After trying multiple distributions, the best for the column Age is "genlogistic", from Scipy, known as a generalized logistic continuous variable. Here we see the histogram of the variable fitted into the distribution and in the title, the parameters of it.
6. Applying a distribution
With the distribution, we can sample the data using Scipy. For that, import stats from scipy.
First we fit the distribution to obtain the parameters of the model.
The fit method returns the Maximum Likelihood Estimates for the shape, location, and scale parameters from the data.
This fit is computed by maximizing a log-likelihood function, with a penalty applied for samples outside of range of the distribution.
With these parameters we will be able to replicate the distribution of our data.
7. Sampling from the continuous distribution
To sample, we will use the rvs method from the chosen distribution imported from the module stats of scipy. This method is for generating random variations of a given type.
We pass the size of the desired sample to the parameter size. In this case, it will be the size of the whole dataset. As a second argument, we pass the calculated values params from the fitting as a key python argument, specified with an asterisk before the argument.
We see the resulting sample of the age values following the distribution specified before. Nevertheless, they aren't integers.
8. Sampling from the continuous distribution
If we would like to make the data discrete, we can round the value in the column, using the round method from pandas.
Here we see that we obtain a dataset with the age values rounded to their closest integer.
9. Let's practice!
Now is your turn to work with some datasets!