Reducing memory pressure

1. Reducing memory pressure

In the last video, we used Categorical and Enum to shrink repeated strings. Now we help the tourism team go further by measuring overall memory use and downcasting numeric columns.

2. Events dataset

We start with the same events DataFrame.

3. Events dataset

But we also need to incorporate the remaining numeric columns. The visitors column stores the estimated number of attendees. The profile column stores values from 1 to 5 where 5 is a well-known international event whereas 1 is more for locals. The price column has average ticket price with 0 for free events.

4. Estimating memory use

We call estimated_size on the events DataFrame to get its estimated memory footprint. But the answer comes back in bytes, which the team finds hard to read.

5. Estimating memory use

We show them that passing the mb argument gives the answer in more familiar units with an estimated size of 171Mb. For larger DataFrames, the team could also pass gb to get the size in gigabytes or even tb to get the size in terabytes! Now the team wants to reduce the memory pressure on their pipelines.

6. Encoding repeated strings

First, we suggest that the team cast the columns with repeated string values to Categorical. They start by casting the repeated area values.

7. Encoding repeated strings

They can also cast the tags column, which contains many repeated values, from a list of strings to a list of categoricals.

8. Encoding repeated strings

They can even cast the string fields in the nested venue_context fields to Categorical. To do this, we call struct.with_fields. This is the struct equivalent of with_columns.

9. Encoding repeated strings

Inside with_fields we call pl.field with the list of field names and cast them to Categorical. So pl.field is the equivalent to pl.col for a struct field.

10. Encoding repeated strings

Together, this gives us the updated DataFrame with categorical columns. The area and tags columns are clearly categorical, but it's not visible for the venue_context struct column.

11. Encoding repeated strings

We started with a DataFrame of 171Mb, but with all repeated strings cast to Categorical, the estimated size is down to 146Mb, a 17% reduction. Now we will see if we can reduce memory pressure from the numeric columns.

12. Numeric event data

The events DataFrame has two integer columns, visitors and profile, and one float column, the average price. By default, Polars stores integer and float columns with a 64-bit dtype.

13. Integer range

We can check the largest and smallest values a numeric column can store with the upper and lower bound expressions. While the largest events like the Chicago Air and Water Show draw a crowd of a million people, the 64-bit integer dtype allows values of up to 9 quintillion people. That seems like more than we need for the visitors column! So how many bits do we need?

14. Integer range

We check by casting visitors to lower precision and calling upper_bound again. A 32-bit integer column can hold values up to about 2 billion, a 16-bit integer column can hold values up to about 33 thousand, and an 8-bit integer can hold values up to 127.

15. Float precision

Float columns like price can sometimes be downcast to 32 bits. For float columns, the main question is usually precision rather than range: will the lower-precision dtype preserve calculations closely enough for our analysis? For a column like price, where we only need approximate summary statistics such as means, Float32 is often acceptable.

16. Downcasting numeric columns

So we agree that the visitors and profile columns can be downcast to 32 and 8-bit integers, respectively, while the price column can be downcast to 32-bit floats. Every reduction in bits leads to a corresponding decrease in memory pressure. So the 8-bit profile column uses eight times less memory than the 64-bit column.

17. Measuring the memory change

When we take all of our changes together then overall memory use falls by around 30% to 131 Mb. At this point, the main barrier to further reductions is the event title and date columns, which cannot be reduced in size.

18. Let's practice!

Now the team has relieved memory pressure in their pipelines, it's your turn to practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.