Running total

1. Running total

In this video, we will learn how to calculate multivariate running totals using window functions SQL.

2. What is a running total?

But, what is a running total? A running total is the total of a sequence of values, or the cumulative sum of values. When a new value is added, we add to the total of all previous values. Here is a quick example. We start with a value of one. The sum of itself is one. We go to the next value in the sequence, two. The running total is three because we took the sum the two previous values, one and two. Let's go for one more. The new value is three, we some up all previous values so the running total is now six. In SQL, we can calculate the running total using the SUM function with the OVER clause to make a window function.

3. Calculate the running total

This table contains multiple time series. Each time series is identified by an id. Each time step of the series has a timestamp and a views value. This query adds a field called sum that has calculated the running total of the views field. Note that the running total starts over when the id changes. This is because the window function is partitioned on id. The order in which the running sum is calculated depends on the ORDER BY clause inside the OVER clause. In real life, however, the data may not be provided in an ideal way so we'll need to make a few additional calculations first.

4. The dataset: a train schedule

Let's see how this works with real train data. Here is a table we've seen before, containing a train schedule for a train line. Each row corresponds to a stop on that line. Each row provides the arrival time for the corresponding stop. We'll start by fetching the time of the first stop.

5. Lead function

To do this, we need to use the LEAD function. This function allows us to look at values ahead of our current row. It takes two arguments, the first is the value of our current position and the second is the offset or how many rows we want to look ahead.

6. Using the lead function

Let's apply this to our train data. We will use LEAD on the arrival times, and we specify we want to look one row ahead to get the information for the next stop. Notice the use of OVER meaning LEAD is now a window function. We partition by train id, which corresponds to the train line. Partitioning in this way ensures that the next arrival time is pulled only from the same train line and not a different line.

7. Using the lead function

The result tells us that each train line has seven stops. It also gives a new field of next_arrival_time, the value of which is empty in row seven because there is no next arrival_time for that train.

8. Calculating duration of each stop

Now we need to calculate the time between each stop, and having the new field makes this more straightforward. We can use the LEAD function again, exactly the same way only this time we subtract the arrival time to get the difference, or duration.

9. Calculating duration of each stop

The result gives the duration from the current stop until the next stop. We can now use the SUM OVER window function on the duration field to calculate the running total for a specific train line!

10. Lag function

Before we move on to exercises, let's quickly review one more function. The LAG function. It works the same way as the LEAD function, but it looks back to rows behind instead of ahead. Here is an example query using the LAG function to grab the previous monthly temperature. The result gives the average temperature for the previous row, if it exists within the partition.

11. Let's practice!

Let's practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.