Get startedGet started for free

Statistical aggregate functions

1. Statistical aggregate functions

In the prior video, we learned about a number of aggregate functions. Now we're going to cover a couple more, as well as our first of several window functions.

2. Statistical aggregate functions

SQL Server has five aggregate functions which relate to statistics. The AVG() function gives us the mean of the expression we pass in. SQL Server has two standard deviation functions. The first, STDEV(), is the sample standard deviation. The second, STDEVP() is the population standard deviation. Generally, we'll want to use the sample standard deviation unless you know you're looking at the entire population. The same goes with variance, where we have VAR() for sample variance and VARP() for population variance.

3. What about median?

You may have noticed in the prior list that we have mean but no median. SQL Server does not have a median function built-in. What we do have is the PERCENTILE_CONT() function. The PERCENTILE_CONT() function takes a parameter, which is the percentile you'd like. Here, we want the 50th percentile, or median. Then, you specify the group, which is how you order the data set. We want to look at a column named SomeVal. Finally, we have the OVER() clause, which allows us to partition the data and get a median--that is, it helps us define our window. Notice that we have a TOP(1) clause as well. Because PERCENTILE_CONT() is a window function, it will return one row for every row sent in. This means it is not an aggregate function like the others, so we need to use TOP(1), DISTINCT, or something else to identify that we want just one row.

4. But how bad is it?

This is an execution plan for the AVG() function against a table with 12 million rows. SQL Server scans the 12 million rows, aggregates the data internally, and eventually gives us the one row we expect.

5. This bad

This is the execution plan for calculating the median. If it's too small for your screen, don't worry--there won't be a quiz. What happens is that the 12 million rows get scanned, stored in temporary storage called a lazy spool, iterated over each other, projected out in a segment - sequence - compute combination which happens with window functions, joined again to the lazy spool, aggregated, joined again to the lazy spool, and finally we get back the one row we need. If this seems like it's overly complicated, you're absolutely right.

6. The cost of median

This chart sums up calculating median versus mean on a 12-million row table. It took 68-point-5 seconds on my computer and had to run single-threaded. By contrast, SQL Server was able to run the AVG() function against all 24 of my cores, taking 8 seconds of CPU time and turning it into a third of a second of real time. There's also a huge difference in reads, 72 million versus 39 thousand. The 72 million reads comes from reading the lazy spool and especially all of those joins. In short, this is not something you want running against large tables on a busy production server.

7. Let's practice!

Now that we know about the statistical aggregate functions and the PERCENTILE_CONT() window function, let's try them out. Don't worry--the data set will be small enough that we can use PERCENTILE_CONT() without everything falling apart.