Rolling and cumulative aggregations

1. Rolling and cumulative aggregations

We've learned to aggregate data by time windows. Now let's explore rolling and cumulative statistics for even more powerful analysis.

2. Why rolling statistics?

Electricity prices can be extremely volatile, jumping dramatically from one hour to the next.

3. Why rolling statistics?

We can calculate rolling statistics that help us smooth out this noise so we can identify underlying trends.

4. Rolling mean on a column

Let's start by calculating a rolling mean to smooth the price data.

5. Rolling mean on a column

We use the rolling_mean expression on a column.

6. Rolling mean on a column

We specify the window size - here we use 3, meaning each row is the average of this row and the previous 2 rows.

7. Rolling mean on a column

And use an alias to create a new column called smoothed.

8. Rolling mean on a column

The first two values are null because we don't have enough previous values for a 3-row window.

9. Rolling mean on a column

From the third row, we get the rolling mean of the current and two previous prices. For example, the last row's 30.1 is the mean of 34.2, 28.7, and 27.5.

10. Centering the window

If we want the rolling mean to include the current, previous, and next row, we can center the window.

11. Centering the window

We set center=True. We then get smoothed values from the second row. A centered window is useful when you have a full record - for example, smoothing a complete day's temperatures after the fact. Use the trailing window when you don't know what comes next.

12. Other rolling expressions

Polars offers other rolling expressions, including rolling sum, standard deviation, maximum, and minimum. However, you'll use rolling_mean most often.

13. Rolling on a DataFrame

Rolling expressions are great for individual columns, but if we want to roll many columns, we can use the `rolling()` method on a DataFrame.

14. Rolling on a DataFrame

With rolling(), we specify the time column as the index.

15. Rolling on a DataFrame

Then we set period as the rolling window duration.

16. Rolling on a DataFrame

Then we chain agg to specify aggregations. Here we want to get the 3-hour rolling average for all columns.

17. Rolling on a DataFrame

Each row now shows the average of the preceding 3 hours.

18. Why cumulative statistics?

Now let's turn to cumulative statistics. Instead of rolling windows, cumulative values track running totals from the start of the data. This chart shows instantaneous solar power, and we want to detect whether peak solar output has occurred so far today.

19. Why cumulative statistics?

So we add the cumulative maximum to see when solar power is below the maximum so far.

20. Cumulative statistics

To track the maximum solar power seen so far today,

21. Cumulative statistics

We use the cum_max expression.

22. Cumulative statistics

which we add as a new column.

23. Cumulative statistics

Each row shows the max of all solar values up to and including that row. By 11 AM, we reach the maximum value.

24. Cumulative sum for running totals

Another common use case is tracking running totals. Here we use cum_sum to calculate the total solar energy produced so far. So on the last row, we get 342, which is the sum of the solar values of 62, 100, and 180 so far that day.

25. Let's practice!

Now we can smooth data with rolling statistics and track running totals with cumulative functions. Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.