Get startedGet started for free

Using Categorical and Enum dtypes

1. Using Categorical and Enum dtypes

The Chicago tourism team finds that their pipelines are memory-intensive. We identify one cause: their events table has numerous string columns where the labels repeat on many rows. Today we help them encode these columns more compactly.

2. Encoding repeated strings

The dataset has an area column with the name of the neighborhood where the event takes place. As there is a finite number of neighborhoods, these values repeat through the dataset, such as Loop in the first and fourth rows here.

3. Encoding repeated strings

We can reduce memory use by encoding these strings as integers, where Polars stores the mapping from string to integer. This works because the integer representation uses less memory than storing the corresponding string.

4. A dataset with repeated labels

Once again, we use the events DataFrame.

5. A dataset with repeated labels

A value_counts on the area column confirms each label repeats thousands of times in this dataset. Loop alone has more than 30000 rows. When string columns have lots of repetition like this, then a Categorical representation can save memory.

6. Creating a Categorical column

To create a categorical dtype, we cast the area column to the Categorical dtype. We assign this to a new DataFrame called events_cat.

7. Creating a Categorical column

In the output, the area column looks like a normal string column apart from the cat dtype.

8. Creating a Categorical column

If we want to see the underlying integer encodings, we can use to_physical. This can be handy if you are dealing with strings that look the same but have different representations. As the area column looks like a string column, the team wants to know if they can use normal string expressions.

9. Using a categorical expression

We show them that there are a couple of categorical expressions for string analysis. Here, for example, the team wants to flag every event in an area starting with "West" for a west-side promotion. We call .cat.starts_with directly on the encoded area column and create a "westside" alias.

10. Using a categorical expression

We then filter by the westside column and inspect the first rows. This gives us promising events like Greektown Market and the West Loop Chef Showcase. But we caution the team that if they want string expressions other than starts_ or ends_with, they need to cast the column back to string dtype.

11. Categorical and Enum

Categorical works well when labels repeat, but new values can still appear later. Polars has an alternative dtype to encode repeated strings called Enum. With an Enum dtype, we specify the allowed categories up-front, which gives both compact storage and built-in validation.

12. Creating an Enum column

An Enum is a good option for the tourism team as they have a fixed list of approved Chicago neighborhoods. The first step is to define an Enum dtype object that specifies every allowed area.

13. Creating an Enum column

We then cast the area column to Enum with area_enum, storing the result in a new DataFrame called events_enum.

14. Creating an Enum column

The values look unchanged except that the dtype is now enum. If someone tried to add new data where the name of the neighborhood was entered incorrectly, then this cast to Enum would raise an error. That validation is what Enum adds on top of Categorical.

15. Inspecting the Enum dtype

If the team needs to check how the enum is defined, the dtype itself carries the full list of valid categories. The team should reach for the Enum dtype when their vocabulary is fixed and Categorical when it can still grow.

16. Enum memory efficiency

We show the team a further difference between Categorical and Enum dtypes by calling to_physical on the Enum column. The dtype of the physical column is u8. This is an unsigned 8-bit integer dtype because Polars can encode all enum strings with just 8-bit integers. So this Enum dtype uses less memory than the 32-bit integers used by the categorical dtype.

17. Let's practice!

We helped the team cast repeated string labels to Categorical and Enum and used .cat expressions on the encoded column. Now it's your turn to practice with compact encodings.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.