Working with nested columns

1. Working with nested columns

Now the Chicago tourism team needs further assistance with nested data, but this time the nesting is by columns rather than arrays.

2. Nested column data

Each event in the dataset has a venue_context column that captures multiple aspects of the venue in a single cell. The team wants to understand how to work with this data.

3. Venue context data

In the events DataFrame, the venue_context column has a struct dtype. The struct[2] dtype in the header tells us it contains exactly two named fields. Each value in venue_context shows as a pair of fields between curly braces.

4. Inspecting the nested schema

The schema shows Struct({'venue_type': String, 'venue_space': String}). These are the two named fields nested inside the venue context column.

5. Venue context fields

We can also check the names of the fields within the struct by using the struct.fields expression. This confirms the field names are venue_type and venue_space. The team recalls that a List dtype element is really a Polars Series and are curious what a Struct element is under the hood.

6. Venue context values

If we print out an element from the venue_context, it looks like a Python dict with the field names as keys.

7. Venue context values

But this isn't really what Polars stores internally. In practice, each field in a struct is a Polars Series. We see this by passing a field name to the struct expression. So each field is like a normal DataFrame column that Polars can do fast operations on, even though it takes a bit more work to access. Now the team wants to build their pipeline where they want to filter by the venue type field to find all galleries for an upcoming art festival.

8. Renaming venue context fields

Firstly, the team wants shorter names to make display on their report more concise. We show them how to do this with the struct.rename_fields expression

9. Renaming venue context fields

where we pass the list of new field names. In order to apply our gallery filter and prepare the report, we want the struct fields as ordinary top-level columns.

10. Unnesting the venue context

After renaming the fields

11. Unnesting the venue context

we promote them to full columns by passing venue_context to the unnest method on a DataFrame.

12. Unnesting the venue context

The nested venue data is now a standard tabular shape, so we can apply our Gallery filter.

13. Unnesting the venue context

And then we inspect the first five events. The output confirms that from studio visits to photo exhibitions, these are all events that could be wrapped into the art festival.

14. Creating an object dtype

Sometimes you need to store an arbitrary Python value that doesn't fit a native Polars dtype. In this case, the team wants to do some trend analysis where they find the most common terms in the event titles column. At present, the team does this by defining a function that splits the title text and returns a Python set of unique words from the title.

15. Creating an object dtype

They then pass their function to map_elements.

16. Creating an object dtype

The returned column is a Polars Object dtype called title_word_set.

17. Creating an object dtype

After printing the results, dtype for the title_word_set column shows as object, and the values are real Python sets, not Polars-managed arrays. The Object dtype trades performance for arbitrary Python flexibility. Use it only when no native nested type fits, as in this case, where a struct dtype is not an option because of the variable number of elements.

18. Let's practice!

Now it's your turn to practice with nested columns.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.