1. Grouped aggregations
Now we'll learn
how to aggregate data in groups using Polars.
2. Our vacation rentals dataset
We are asked to produce some statistics on the vacation rentals dataset for an annual report.
3. Counting values in a column
We first want to know how many of each
type of property we have.
We start by selecting the property type column, as a Series.
4. Counting values in a column
We then call the .value_counts() method to count occurrences in the Series.
The output is a new DataFrame with two columns: the unique values from the type column and their counts. By default, the output order changes every time.
5. Counting values in a column
However, in our report we want the properties in order from most common to least common.
We add the sort=True argument to .value_counts to apply this order.
6. Visualizing aggregated data
We can make these counts into a bar chart for our report with Plotly.
Here we import plotly express as px, pass the DataFrame as the first argument to px.bar(),
followed by the column names for the x and y-axes.
7. Grouping and aggregating
Now we calculate statistics for our report
where we group by the property type.
8. Grouping and aggregating
We want to get the average price by property type. Inside the .agg method we specify which aggregations to perform. We get the average price for each property type with pl.col("price").mean().
This returns a DataFrame with one column for the property type and another column with the average price for that property type. The output order varies each run, like with .value_counts.
9. Controlling group-by order
To control the .groupby output order,
we can either pass the maintain_order=True argument
so groups appear in their original input order. Or we can add
an explicit sort after the aggregation so we sort by the price column,
arranging results from cheapest to most expensive property type.
10. Multiple aggregations
We need multiple property type statistics for our report that we can calculate together.
Here, we calculate the average and maximum price per property type, using name.suffix to create descriptive column names.
We also calculate the average review score
and use .alias() to name the output column.
11. Grouping by multiple columns
We need to provide more detailed insights for the appendix of our report.
For this we need to group by multiple columns.
The price column in the output shows average prices by property type AND number of bedrooms, while the pl.len expression counts the number of rows in each group.
12. Query optimizations
We can reduce the time needed to run our analysis with query optimizations, as seen in this lazy query where we group by multiple columns.
13. Query optimizations
Using explain() shows the optimized query plan.
The output shows PROJECT 3/8 COLUMNS.
In other words, Polars sees that we only need 3 out of the 8 columns (type, bedrooms and price) for this groupby-aggregation.
14. Let's practice!
Now it's your turn to practice grouping and aggregation!