1. Learn
  2. /
  3. Courses
  4. /
  5. Data Privacy and Anonymization in Python

Exercise

Explore the distribution of data

When we want to anonymize a dataset by sampling data in a very realistic way, we need to acquire some domain and statistical knowledge of the data. As we have seen, finding the probability distribution of the column of interest is key.

In this exercise, you will explore the column business_travel from a simplified version of the IBM HR dataset.

The DataFrame has been imported as hr and numpy as np. As said in the previous chapter, pandas has been imported as pd for this and the rest of the course.

Instructions 1/3

undefined XP
  • 1
    • Print the absolute frequencies of each unique value in the business_travel column.
  • 2
    • Print the probability distribution of the business_travel variable (i.e., the relative frequencies of each category).
  • 3
    • Generate a bar plot to visualize the absolute frequencies of each category in business_travel using the .value_counts() result.