Aggregating time series data

1. Aggregating time series data

We will be applying aggregations to time series data in this video.

2. Measuring the length of time series

Here we have a table called dc_news_fact that contains time series data for news articles. Each row gives the number of views within a 20 minute period. The dc_news_dim table provides the title of each news article. This query demonstrates how to measure the length of each time series in the table, by counting the number of rows it has.

3. Measuring the length of time series

In this case, each time series has exactly 144 data points, corresponding to 48 hours of data with three data points per hour.

4. Counting number of non-null entries in a time series

Here we count the number of data points for which the number of views is not null. This is because the COUNT(views) operation counts the number of rows where the views field is not null. It turns out that every one of these has 143 non-null samples. This is because of the way that this data was sampled - the first row always has a null value for the views field because no one has read the article yet.

5. Counting number of non-zero entries in a time series

This query demonstrates how to count the number of non-zero entries in each time series. It adds a WHERE clause to filter the views that are greater than zero.

6. Calculating min and max over time series data

Here we aggregate a field in a time series using MIN and MAX. If instead of the number of entries in a series, we wanted the minimum and maximum number of views per sampling period, we could use this query. It gives the min and max number of views per sampling period per news article over its lifespan. Each time series has at least one sampling period where the article had zero views. The maximum number of views per 20 minute sample ranges from 53 to 4161.

7. Summing time series data

Suppose we were interested in the total number of views across the entire time series. Here's how to do that.

8. Adjusting time granularity

Now, what if we want the total number of views per day for each time series? In the example data we've been using, the time series data is at a different granularity. We can adjust the granularity of the data from 20 minutes to a day, so that we may evaluate each time series over time frames of one day per time frame. One way to achieve this is by truncating the timestamp with DATE_TRUNC. Note how this erases the time information, replacing it with zeros.

9. Adjusting time granularity

Another way to achieve the previous result is to cast the timestamp as a date. Note that we used the cast operator here instead of the CAST function. They both achieve the same result.

10. Measuring the days

Let's look at one more example. We know that these time series are sampled over 48 hours. We can answer the question of how many days each one spans by using COUNT DISTINCT. It turns out that each of the time series overlaps with three different days.

11. Let's practice!

Let's practice what we just learned.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.