1. Computations by groups
In this lesson, you will put together all the things you have seen so far to perform computations by group. You will now fully realize the potential of data table's [i, j, by] syntax.
2. The by argument
Most data wrangling tasks require doing the same computation on several groups of data. For example, how would you calculate the total number of trips for each start_station?
This can be done using the "by" argument data table provides. Specifying by equals start_station groups the data by that column. The expression in "j", here dot-N, is computed for each group. Dot-N is the special symbol that contains the total number of rows, if you recall. When performing a grouping operation, it contains the number of rows for each group.
Note the columns in the resulting data table. Firstly, the column corresponding to the total rows computed using the special symbol dot-N is automatically named capital "N", for convenience. Additionally, the column used in "by" is retained in the result. Also note that the column by which you group your data is always returned as the first column.
3. The by argument
In the previous example, we provided a character vector to the "by" argument, but it also accepts a list of variables/expressions as shown here.
Once again, dot parenthesis and list parenthesis are both identical. Dot parenthesis is used here for convenience.
By equals start_station from the previous slide and by equals dot parenthesis start_station are just different ways to obtain the same result.
4. The by argument
However, just as we saw in the earlier chapter, dot parenthesis has its advantages. It allows for naming the columns in the resulting data table on the fly. As shown in here, by equals dot parenthesis start equals start_station results in the grouping column being renamed to "start". Similarly, dot parenthesis no_trips equals dot N results in the column being named as "no_trips". If no name was provided, it would have been automatically renamed to "N" as shown in the last example.
5. Expressions in by
You should be quite familiar now that dot parenthesis notation, in addition to naming the columns on the fly, also allows for computations to be performed on the columns on the fly.
Suppose you'd like to get the total trips from each start_station, but additionally, for every month, how would you go about it? The month column does not exist in batrips. However, it can be extracted from the start_date column using the month() function available in the data table package. Since dot parenthesis allows for expressions to be provided directly within, by equals dot parenthesis start_station comma mon equals month(start_date) groups by the required columns. In addition, we also renamed the column in the result to "mon".
Since there was no name provided to "j" expression, it is automatically named "N".
With a single line of code, we've computed the total trips for each start station for each month.
If you're curious, think about how you would get the total trips for every start station, but only for the month of March.
6. Let's practice!
Now it's time for you to use the "by" argument.