Get startedGet started for free

Percentiles

1. Percentiles

The last method covered in this chapter to understand a dataset's distribution is the percentile. Percentiles help you understand what percentage of your dataset is beneath a certain value, and what percentage is at or above it.

2. Percentiles - overview

The nth percentile is the value for which n% of your data is beneath this value. The lowest or minimum value in your dataset is the 0th percentile -- 0% of your data's values are beneath this value. The highest or maximum value is the 99th percentile -- 99% of your data's values are below this value.

3. Quartiles

The 0th and 100th percentiles aren't very useful, though. Assume that you're exploring the distribution of orders per user. If the 25th percentile of that dataset is 17, 25% of users have ordered less than 17 times. Conversely, 75% of users have ordered more at least 17 times. The 25th and 75th percentiles are called the first and third quartiles respectively. A good way to remove outliers from your dataset is to remove everything outside of the interquartile range, or IQR. The IQR is defined as all data between the first and third quartiles. The 50th percentile or the second quartile is the median, or middle value in the dataset. Don't mistake the median for the mean (or average)! These are two different values. Let's see how.

4. Skewed data

If your dataset's median is less than the average (or mean), your data is skewed positively, meaning that some high values are pushing the average up, whereas most values are beneath that average. If your dataset's median is more than the average, your data is skewed negatively, meaning that some low values are pushing the average down, whereas most values are above that average.

5. Quartiles - query

To calculate the quartiles of the orders frequency table, you first need to store each user's count of orders in a CTE. The PERCENTILE_CONT window function takes a decimal value between 0 and 1 and returns the associated percentile value. Here, the query returns the first (0.25), second (0.5), and third (0.75) quartiles, as well as the mean (or average). Remember to order your results in ascending order in the window, because percentiles are based on the order of the data.

6. Quartiles - result

This is the result. 25% of users ordered less than 6 times; 50% of users ordered less than 8 times, and so on. Since the mean is slightly higher than the median (the 50th percentile, or the second quartile), the data skews slightly positively.

7. Percentiles

Percentiles and quartiles are quick ways to determine values by which you can separate your dataset into groups. They're also used to determine the skew of your data. Practice writing queries to calculate percentiles in the following exercises.