Categorical data in pandas

1. Categorical data in pandas

The most common way of working with categorical data in Python is through using pandas. Let's take a quick look at how pandas handles categorical data.

2. dtypes: object

After reading in the adult census income dataset, we can print out the dtypes of each column by using the dtypes property. We see dtypes of primarily int64, which is one way to save values as integers, and dtypes of object. By default, pandas tries to infer the data type of each column. In this dataset, the numerical values have been assigned a dtype of int64, while the columns containing strings are stored with the object dtype.

3. dtypes: categorical

By default, columns containing strings are not stored using pandas' category dtype, as not every column containing strings needs to be categorical. Let's look at the dtype for Marital Status. We use the dtype property, as opposed to the dtypes property, since we are working with a Series and not a DataFrame. Pandas uses a capital O to represent the object dtype. We can convert this to the categorical dtype using the astype method and specifying category. This time, the output is quite different. Pandas is telling us that the variable is now saved as categorical and is providing the list of the categories found in the Series. Finally, notice in the print out that ordered equals false - indicating there is currently no order for these categories.

4. Creating a categorical Series

There are two ways to create a categorical pandas Series when your data is not already in a DataFrame format. First we can use pd-dot-series on a list or array of data and set the dtype argument to category. The print out shows we have created a categorical Series with categories of A, B, and C.

5. Creating a categorical Series

The second way is to use pd-dot-categorical. We are showing this alternative way because it allows us to tell pandas that the categories have a logical order by setting the ordered argument equal to true. The order is set by using the categories parameter. Whichever order you list the categories in will be the order of the categories going forward. Notice that the print out states that the order is C, then B, then A, which matches the order we used when creating the categorical Series.

6. Why do we use categorical: memory

There are a few reasons why storing pandas Series with a dtype of categorical is useful. Let's look at the easiest one to quantify: it is a huge memory saver. Take a look at the number of bytes python uses to store the Marital Status column when saved as an object compared to when it is saved as a categorical column. We can do this using the nbytes attribute. In this example, using a categorical dtype reduced the memory footprint by a factor of eight. Since pandas will by default load all the data into your computers memory, reducing your memory footprint can be helpful when dealing with large datasets.

7. Specify dtypes when reading data

If you know the data types of columns before reading in a dataset, it is good practice to specify at least some of the column dtypes. This can be done by creating a dictionary with column names as keys and data types as values. By setting the dtype parameter equal to this dictionary, pandas will set the dtypes of any columns that match keys found in the dictionary. We can quickly check the Marital Status dtype using the dtype method

8. pandas category practice

Let's practice using the category dtype.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.