Get startedGet started for free

Advanced aggregations

1. Advanced aggregations

In this lesson, you will use the colon equal to operator for more advanced aggregations.

2. Recap

In the previous videos, you learned how to use the colon equal to operator in two ways: a) LHS := RHS form and b) the functional form, to add and update columns in a data table by reference. In addition, you also saw how to combine the ":=" operator with the "by" argument to add a single column while grouping. Now let's see how you can add multiple columns by reference while using the "by" argument.

3. Adding multiple columns by reference by group

You can provide all the columns to be added/updated along with their values in either of the two forms: LHS:=RHS or functional form, and add the "by" argument to group by those columns. As you can see here, for each "end_station", the first and last duration values are assigned to the two new columns. Note that you would have identical first and last values of duration for all rows corresponding to the same end station.

4. Binning values

We will now create a new column based on the median duration for each unique combination of start_station and end_station as shown here. As you might have guessed, you can use if-else statements to accomplish this.

5. Multi-line expressions in j

The "j" argument can also handle complex multi-line expressions. To do this, you can use the left-hand-side := right-hand-side form and wrap the expressions inside curly brackets in the right-hand-side. As you can see here, we first calculate the median duration for each group, med_dur. Then, we use a series of if-else statements to create the new column trip_category by reference.

6. Alternative way

Alternatively, you can create a user-defined function to accomplish the same task. The bin_median_duration() function does exactly the same job as the multi-line expression on the previous slide.

7. All together - i, j and by

So far you have used the "j" and "by" arguments to add/update columns by reference. Now let's combine all three - "i", "j", and "by" together. In the example shown here, data table first evaluates the expression in "i" which returns all the rows where duration is greater than 500. Then on those rows, the "by" argument is applied by creating groups for each unique combination of "start_station" and "end_station". Finally, for each of those groups the expression in "j" is evaluated. We thereby obtain the smallest duration greater than 500 for each specified group. If all values for a particular group are less than 500, there are no rows to group by, so the result would be NA, as shown in the first two rows of output here.

8. Let's practice!

Now it's your turn to practice advanced aggregations.