Get startedGet started for free

Bucketing

1. Bucketing

Histograms are great at visualizing the distribution of values in a dataset. So far, though, you can't customize the histograms -- if you're plotting a histogram of orders by user, you can't separate users into separate groups and check the frequencies there.

2. Histograms - recap

Recall that a histogram visualizes a frequency table. A frequency table returns the frequency of each value in a dataset. This query returns the frequency table of orders per user.

3. CASE expression

To customize your histogram and create groups, you'll need the CASE statement. The CASE statement is how SQL implements conditional logic. This query separates meals into three price categories, or "buckets": Low, medium, and high-priced meals. If a meal's price is less than $4, it's a low-priced meal; if it's less than $6, it's a medium-priced meal, otherwise, it's a high-priced meal. For each conditional statement, start the condition with WHEN, and start the resulting value of that condition with THEN. If there's a default value, precede it with ELSE. Grouping by the price category column and counting the meal IDs returns a table of each category and how many meals are in it. Using CASE like this is an effective way to split your data into categories. Let's apply this to the frequency table query.

4. Bucketing - query

Recall the frequency table of user revenues query from the previous lesson. The CTE remains the same; it stores the revenues that Delivr generates from each user. However, as opposed to rounding the revenue column and grouping by it, the CASE statement splits users into three categories: low, medium, and high revenue users. Grouping by the category column and counting the user IDs in each category returns the total number of users in each category.

5. Bucketing - result

This is the result of the query. As opposed to having all values and their frequencies in the table, you have the categories or buckets you defined the CASE statement. The result is cleaner and easier to interpret. Make sure that the categories don't overlap by setting good conditions.

6. Histograms versus bar graphs

As opposed to a histogram, you can visualize the result using a bar graph. Each bar represents a category, and its height is the number of users in that category. The advantage of bar graphs over histograms is that the data is summarized. Histograms show the full spectrum of values found in the data, whereas bar graphs show a summarized version. This is useful especially when presenting to management, who often want a quick and easy way to understand the data's distribution.

7. Bucketing

Bucketing allows you to summarize the frequency tables and present a dataset's distribution in a cleaner way. In the following exercises, you'll write queries that split your users into categories.