Grouped aggregations

1. Grouped aggregations

Now we'll learn how to aggregate data in groups using Polars.

2. Our vacation rentals dataset

We are asked to produce some statistics on the vacation rentals dataset for an annual report.

3. Counting values in a column

We first want to know how many of each type of property we have. We start by selecting the property type column, as a Series.

4. Counting values in a column

We then call the .value_counts() method to count occurrences in the Series. The output is a new DataFrame with two columns: the unique values from the type column and their counts. By default, the output order changes every time.

5. Counting values in a column

However, in our report we want the properties in order from most common to least common. We add the sort=True argument to .value_counts to apply this order.

6. Visualizing aggregated data

We can make these counts into a bar chart for our report with Plotly. Here we import plotly express as px, pass the DataFrame as the first argument to px.bar(), followed by the column names for the x and y-axes.

7. Grouping and aggregating

Now we calculate statistics for our report where we group by the property type.

8. Grouping and aggregating

We want to get the average price by property type. Inside the .agg method we specify which aggregations to perform. We get the average price for each property type with pl.col("price").mean(). This returns a DataFrame with one column for the property type and another column with the average price for that property type. The output order varies each run, like with .value_counts.

9. Controlling group-by order

To control the .groupby output order, we can either pass the maintain_order=True argument so groups appear in their original input order. Or we can add an explicit sort after the aggregation so we sort by the price column, arranging results from cheapest to most expensive property type.

10. Multiple aggregations

We need multiple property type statistics for our report that we can calculate together. Here, we calculate the average and maximum price per property type, using name.suffix to create descriptive column names. We also calculate the average review score and use .alias() to name the output column.

11. Grouping by multiple columns

We need to provide more detailed insights for the appendix of our report. For this we need to group by multiple columns. The price column in the output shows average prices by property type AND number of bedrooms, while the pl.len expression counts the number of rows in each group.

12. Query optimizations

We can reduce the time needed to run our analysis with query optimizations, as seen in this lazy query where we group by multiple columns.

13. Query optimizations

Using explain() shows the optimized query plan. The output shows PROJECT 3/8 COLUMNS. In other words, Polars sees that we only need 3 out of the 8 columns (type, bedrooms and price) for this groupby-aggregation.

14. Let's practice!

Now it's your turn to practice grouping and aggregation!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.