Aan de slagGa gratis aan de slag

Sampling from the same probability distribution

Many organizations, such as the U.S. Census, publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques, and then a tiny fraction of 1% to 5% of a sample is released to enable calculations. Sampling is known to preserve the data's statistical characteristics, allowing people to study and understand the underlying population.

In this exercise, you will anonymize the column department of the IBM HR dataset by sampling from the original dataset's distributions.

The dataset has been loaded as hr.

Deze oefening maakt deel uit van de cursus

Data Privacy and Anonymization in Python

Cursus bekijken

Oefeninstructies

  • Obtain the relative frequencies of each unique value in the department column.
  • Extract the probabilities from counts and store them in a variable called distributions.
  • Sample from the previously calculated probability distributions. The size of the sample should be the same as the size of the hr dataset.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Obtain the probability distribution counts 
counts = ____

# Get the probability distribution values 
distributions = ____

# Sample from the calculated probability distributions
hr['department'] = np.random.choice(counts.index, 
                                    p=____, 
                                    size=len(____))

# See the resulting DataFrame
print(hr.head())
Code bewerken en uitvoeren