Get Started

Downsampling and upsampling data

1. Downsampling and upsampling data

Throughout this chapter so far, we have looked at various aggregation functions. Now I want to pivot slightly and think about periodicity, or the grain of our time data.

2. Data in nature

Suppose we have some table which has some date. When we query that data, we're going to get results. For DATETIME and DATETIME2 data types, we will get back our data as a combination of date and time. This is useful information for point analysis--who did what when--but we will often want to change this grain.

3. Downsampling data

Downsampling is another term for changing to a coarser grain. For example, we can cast our DATETIME type to a date type. This gives us a coarser grain: daily data rather than a combination of date and time.

4. Further downsampling

This is not the only form of downsampling, however--we can also downsample to other grains. For example, suppose we want to roll up to the hour instead of day. We can use a combination of the DATEADD() and DATEDIFF() functions to do this. To understand this function, let's start from the inside and work our way out. We figure out the number of hours from SQL Server's starting point (that is, time 0) until our customer visits start. The DATEDIFF() function loops off any unused date or time parts, so anything lower than the hour goes away and DATEDIFF() returns an integer representing the number of hours from time 0 until SomeDate. Then, we add that number of hours to time 0, giving us a rounded total. The end result is a DATETIME data type rounded to the nearest hour.

5. What about upsampling?

To this point, I've focused on downsampling because it is by far the more useful of the two. Upsampling is the opposite of downsampling: taking aggregated data and attempting to disaggregate it. Downsampling typically works by summing up or counting the results in our table, so it is pretty easy to do. Upsampling, meanwhile, requires an allocation rule: if you have data at an hourly grain, how will you allocate it to each of the 60 minutes? This leads to the biggest problem of upsampling: it provides an artificial granularity. With our hourly data, by allocating records to particular minutes, we have to pretend that we know more than we really do. Still, there are uses for upsampling data. For example, with data at the daily grain, we might be able to use a Poisson or a hypergeometric to model occurrences and generate test data at the lower grain. We can also use upsampled data to get an idea of what a normal minute or hour might look like given daily data, although finding any answer better than a uniform distribution will require some knowledge of the underlying distribution of events. Downsampling is much more common and much more acceptable in the business world than upsampling--for every time I have needed to upsample data, I downsample, without exaggeration, hundreds of times.

6. Let's practice!

Now that we've covered some of the theory and technique behind downsampling and upsampling, let's try out a few exercises.