Get startedGet started for free

Handling array data

1. Handling array data

In the last chapter, we got data into Polars from files and databases. Now we're asked to help a different group, the Chicago tourism team. They are building pipelines to understand trends in popular events, but they need help working with nested data and keeping their pipelines memory-friendly.

2. Array data

The tourism team records every event that occurs in the city along with its characteristics in the tags column. However, events can have multiple characteristics, such as the night market in Greektown, which they tag under food, shopping, and nightlife. The tags column is an example of array data.

3. Events dataset

The tourism team stores their data in Parquet, which has built-in support for array data. We load it into the events DataFrame.

4. Events dataset

To introduce us to the data, they select the tags column along with other key information about each event. The dtype of the tags column is list[str], so Polars sees each value as a list of strings. In a list column, the lists can be of different lengths on different rows. For example, Folk Festival has 3 tags, whereas Chef Showcase has two. The team asks whether each element of a list column is a Python list object.

5. Event tags

We select the first element and show them that each element is actually a Polars Series with a string dtype. So working with array data in Polars is fast because Polars is still working with its native dtypes. The team tells us that the first tag on each row is the primary tag, and they want to extract this into a separate column.

6. Getting the primary tag

We show them how to do this by first selecting the event title and the tags column.

7. Getting the primary tag

Then we extract the first element of each list by using the list.get expression with 0 as the argument to get the first value.

8. Getting the primary tag

We give this a clean column name for downstream use by following up with the alias expression.

9. Getting the primary tag

For the Chef Showcase, the primary tag is food, while for the Folk Festival, it is music. The events team is planning a family-focused campaign and needs to know how many tags each event has and which ones are family-friendly.

10. Parsing event features

We start by selecting the title and tags columns.

11. Parsing event features

And then use the .list.len expression to count how many tags each event carries and call this tag_count.

12. Parsing event features

Then, to find family-friendly events, we pass family to .list.contains("family"). With this expression, Polars returns True if the list contains family. We call this has_family.

13. Family-friendly events

Once we filter by the has_family column, the team can see their family-friendly events. The tag_count column shows how multi-faceted each event is for each of these events where has_family is True.

14. Polars list expressions

Polars has many more list expressions, including expressions to index a list like list.first de-duplicate a list like list.unique and do arithmetic on numeric lists like list.mean The full set can be found in the documentation. Next, the events team wants to find the most popular tags to plan their content strategy. They ask if there is a list expression for that.

15. Exploding the tags column

We tell them that for a more complicated analysis, we need to reshape the DataFrame so that each tag is on its own row and then use standard DataFrame methods. We start by selecting the title and tags columns.

16. Exploding the tags column

Then we pass tags to the explode method, which unpacks each list into a separate row.

17. Exploded tags

After exploding, the tags column is now a normal string column, while the event_title column lets us track which event each tag came from.

18. Counting tag popularity

Now we can count tag popularity by creating a Series from the tags column and doing a normal value_counts operation.

19. Counting tag popularity

Sorting the output puts the most popular tags first: family dominates while rarer tags trail off.

20. Let's practice!

Now we've helped the tourism team with their work, it's your turn to practice working with list data.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.