1. Working with multiple columns
Next, we'll see how Polars
makes it easier to prepare data by working with multiple columns at once.
2. Using pl.col()
When designing property brochures,
our marketing team needs to know the length of both property names and types to ensure they fit in their templates.
3. Using pl.col()
We pass both name and type to pl.col and then continue with the .str.len_chars expression to count the letters.
The name and type columns now show the number of characters these columns had in the original DataFrame.
4. Using pl.col() with dtypes
If we want to work with all columns with the same dtype
we can pass a Polars dtype to pl.col. Here we pass pl.String to pl.col and continue as before with .str.len_chars.
This gives the same output as before.
5. Using pl.col() with dtypes
The other dtypes that we could use to select columns in this DataFrame
are pl.Int64,
pl.Float64,
and pl.Boolean.
6. Introducing selectors
Polars also has a set of functions called selectors
for creating expressions from multiple columns. Selectors allow you to select all columns with similar dtypes.
Here, we use pl.selectors.string to select all of the string columns
7. Name matching with selectors
Now we need to report on the details of each property. We notice that all of the bedroom columns end with s.
Selectors also have functions to get all columns that have a similar name pattern.
Here we use selectors.ends_with to get all columns that end with s
8. Combining selectors
However, we also need the string name and type columns for our report.
We can combine selectors for more control over which columns are included. Here we select the string columns together with the columns that end with s using the pipe operator
to get the full set of columns we need for our report.
9. Selectors overview
There are different selectors
for different dtypes. And selectors for different
column name patterns such as the start or end letter. There is an excellent guide to selectors in the Polars docs at the link shown.
10. Adding a suffix to a column name
In our report we also need to display the maximum and minimum price and review score for our portfolio.
It is straightforward to do this for a single aggregation on a column as we can keep the original column name.
But if we need multiple aggregations we need the outputs to have distinct names.
11. Adding a suffix to a column name
We can ensure our aggregations have distinct names
by ending each expression with name.suffix. Here we end the .min suffix with underscore min and the .max suffix with underscore max to ensure that
we have distinct column names for our report.
12. Excluding a column
Sometimes it's simpler to specify which columns to exclude from an operation.
To format our report correctly, we need all non-boolean columns to be strings. We use
pl.exclude(beach) to do this.
13. Let's practice!
Now it's time to practice with multiple columns.