Handling array data
1. Handling array data
In the last chapter, we got data into Polars from files and databases. Now we're asked to help a different group, the Chicago tourism team. They are building pipelines to understand trends in popular events, but they need help working with nested data and keeping their pipelines memory-friendly.2. Array data
The tourism team records every event that occurs in the city along with its characteristics in the tags column. However, events can have multiple characteristics, such as the night market in Greektown, which they tag under food, shopping, and nightlife. The tags column is an example of array data.3. Events dataset
The tourism team stores their data in Parquet, which has built-in support for array data. We load it into the events DataFrame.4. Events dataset
To introduce us to the data, they select the tags column along with other key information about each event. The dtype of the tags column is list[str], so Polars sees each value as a list of strings. In a list column, the lists can be of different lengths on different rows. For example, Folk Festival has 3 tags, whereas Chef Showcase has two. The team asks whether each element of a list column is a Python list object.5. Event tags
We select the first element and show them that each element is actually a Polars Series with a string dtype. So working with array data in Polars is fast because Polars is still working with its native dtypes. The team tells us that the first tag on each row is the primary tag, and they want to extract this into a separate column.6. Getting the primary tag
We show them how to do this by first selecting the event title and the tags column.7. Getting the primary tag
Then we extract the first element of each list by using the list.get expression with 0 as the argument to get the first value.8. Getting the primary tag
We give this a clean column name for downstream use by following up with the alias expression.9. Getting the primary tag
For the Chef Showcase, the primary tag is food, while for the Folk Festival, it is music. The events team is planning a family-focused campaign and needs to know how many tags each event has and which ones are family-friendly.10. Parsing event features
We start by selecting the title and tags columns.11. Parsing event features
And then use the .list.len expression to count how many tags each event carries and call this tag_count.12. Parsing event features
Then, to find family-friendly events, we pass family to .list.contains("family"). With this expression, Polars returns True if the list contains family. We call this has_family.13. Family-friendly events
Once we filter by the has_family column, the team can see their family-friendly events. The tag_count column shows how multi-faceted each event is for each of these events where has_family is True.14. Polars list expressions
Polars has many more list expressions, including expressions to index a list like list.first de-duplicate a list like list.unique and do arithmetic on numeric lists like list.mean The full set can be found in the documentation. Next, the events team wants to find the most popular tags to plan their content strategy. They ask if there is a list expression for that.15. Exploding the tags column
We tell them that for a more complicated analysis, we need to reshape the DataFrame so that each tag is on its own row and then use standard DataFrame methods. We start by selecting the title and tags columns.16. Exploding the tags column
Then we pass tags to the explode method, which unpacks each list into a separate row.17. Exploded tags
After exploding, the tags column is now a normal string column, while the event_title column lets us track which event each tag came from.18. Counting tag popularity
Now we can count tag popularity by creating a Series from the tags column and doing a normal value_counts operation.19. Counting tag popularity
Sorting the output puts the most popular tags first: family dominates while rarer tags trail off.20. Let's practice!
Now we've helped the tourism team with their work, it's your turn to practice working with list data.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.