1. Bucketing
Histograms are great at visualizing the distribution of values in a dataset. So far, though, you can't customize the histograms -- if you're plotting a histogram of orders by user, you can't separate users into separate groups and check the frequencies there.
2. Histograms - recap
Recall that a histogram visualizes a frequency table. A frequency table returns the frequency of each value in a dataset. This query returns the frequency table of orders per user.
3. CASE expression
To customize your histogram and create groups, you'll need the CASE statement. The CASE statement is how SQL implements conditional logic. This query separates meals into three price categories, or "buckets": Low, medium, and high-priced meals. If a meal's price is less than $4, it's a low-priced meal; if it's less than $6, it's a medium-priced meal, otherwise, it's a high-priced meal.
For each conditional statement, start the condition with WHEN, and start the resulting value of that condition with THEN. If there's a default value, precede it with ELSE.
Grouping by the price category column and counting the meal IDs returns a table of each category and how many meals are in it. Using CASE like this is an effective way to split your data into categories. Let's apply this to the frequency table query.
4. Bucketing - query
Recall the frequency table of user revenues query from the previous lesson. The CTE remains the same; it stores the revenues that Delivr generates from each user.
However, as opposed to rounding the revenue column and grouping by it, the CASE statement splits users into three categories: low, medium, and high revenue users. Grouping by the category column and counting the user IDs in each category returns the total number of users in each category.
5. Bucketing - result
This is the result of the query. As opposed to having all values and their frequencies in the table, you have the categories or buckets you defined the CASE statement. The result is cleaner and easier to interpret.
Make sure that the categories don't overlap by setting good conditions.
6. Histograms versus bar graphs
As opposed to a histogram, you can visualize the result using a bar graph. Each bar represents a category, and its height is the number of users in that category.
The advantage of bar graphs over histograms is that the data is summarized. Histograms show the full spectrum of values found in the data, whereas bar graphs show a summarized version. This is useful especially when presenting to management, who often want a quick and easy way to understand the data's distribution.
7. Bucketing
Bucketing allows you to summarize the frequency tables and present a dataset's distribution in a cleaner way. In the following exercises, you'll write queries that split your users into categories.