NTILE and CUME_DIST

1. NTILE and CUME_DIST

Let's take what we've done a level deeper, this time with `NTILE` and `CUME_DIST`.

2. Creating buckets of rows

Let's say the marketing team wants to help advertise the right classes to each person. To do that, we'd want to create groups of gym members based on their workouts, with an output that looks like this.

3. NTILE

To "bucket" data in this fashion, we can use `NTILE`. `NTILE` creates "N" equally-sized buckets of records. The syntax is just like any other window function. We pass a number `n` to `NTILE`. Then, we choose the field we'd like to use to create the buckets. If we'd like to evenly-distribute records in each bucket, we can do that with a `PARTITION BY`. We'll see that in just a bit.

4. Bucketing fitness data

To create 50 "buckets" of gym members, we'd build a query like this. After the `member_id`, `gym_location`, and `calories_burned` fields, we pass 50 to the `NTILE` function to create 50 equally-sized buckets of members, based on calories burned. Adding an `ORDER BY` the field you've just created, and `calories_burned` at the end of the query helps to keep the output organized.

5. Bucketing fitness data

We've answered our original question! Using `NTILE`, we've successfully created groups of gym members based on their workouts. Although each bucket is "equally-sized", their gym locations aren't evenly distributed. We can fix that!

6. Evenly-distributed buckets of fitness data

There's only one change needed evenly distribute rows by gym location; adding a `PARTITION BY gym_location`. Other than that, everything else will remain the same.

7. Evenly-distributed buckets of fitness data

After adding the `PARTITION BY`, the records in each bucket are evenly-distributed by location. Now, each bucket has one record for Miami, Portland, and Cleveland.

8. Understanding a distribution

Let's changes gears a bit. What if we want to better understand the distribution of an entire field? For example, what is the distribution of calories burned for each member's workout? Where does a specific workout fall in this distribution? What proportion of members burned the same of fewer calories than a specific member?

9. CUME_DIST

The query here creates a distribution of calories burned. We do this with the `CUME_DIST` function. The values stored in `cd` capture the proportion of records with values less than that value. Partitioning by the `gym_location` field means the distribution will be specific to each location. Like with `NTILE`, we order the final results by `gym_location` and the field we've just created to keep the output organized.

10. CUME_DIST

`CUME_DIST` allows us to compare each record to the distribution of a certain field, known as a cumulative distribution. The field passed to `PARTITION BY` determines the window to create a distribution for. The field passed to `ORDER BY` will determine the distribution. This allows us to determine which proportion of records are less than or equal to an individual record.

11. CUME_DIST

The results are quite powerful! We can now compare each record to it's distribution for that location. We can interpret the output like this; member 541 in the Miami gym burned more calories than 13.1% of that gym's membership.

12. Let's practice!

Alright! It's your turn to use `NTILE` and `CUME_DIST` in Snowflake.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.