1. Computations in j using .SD
In this lesson, you will look at another special symbol, dot-SD, which stands for Subset of Data. It makes computations even more easier and powerful as we will see in a moment.
2. Subset of Data, .SD
dot-SD is an extremely powerful symbol. Understanding it can help you do complex data wrangling in a straightforward manner. As mentioned earlier, dot-SD stands for 'Subset of Data'. When grouping, it holds the intermediate data corresponding to each group while results for that group are being computed. It contains all the columns except the grouping column itself, by default. Let's see dot-SD in action using this data table x.
3. Subset of Data, .SD
You can see that dot-SD contains "val1" and "val2" columns for each unique value of "id". It contains all the rows of those columns corresponding to the group for which results are being computed. The "id" column itself is not included. Also note that these groups are data tables! This means we can perform subsets, selects, computations on the intermediate data table within each group! This is why dot-SD is very powerful.
4. Subset of Data, .SD
So how can you use dot-SD to find the first row for each id?
While grouping by "id" column, you now know that dot-SD would contain all the columns except the grouping column, id, and all the rows. And also that dot-SD is by itself a data table. Therefore dot-SD[1] in "j" returns the first row for each group.
5. Subset of Data, .SD
Similarly, you can use .SD in combination with .N to obtain the last row for each unique id.
6. .SDcols
In the previous examples, dot-SD[1] and dot-SD[dot-N] returned the first and last rows for each "id". However, it returned ALL the columns - "val1" and "val2". What if you would like to just return "val1"? In other words, how can you control the columns that are available to dot-SD?
Using dot-SDcols! It takes a character vector of column names that decides the columns to be included in dot-SD. In the first example here, dot-SD[1] returns the first row for every start station, with ALL the columns.
In the second example, for every start station, only the first row and trip_id and duration columns are returned.
7. .SDcols
Of course, you can also prefix the character vector with a negative sign or the not operator to return ALL the columns EXCEPT the ones provided to dot-SDcols.
8. Let's practice!
Time to put this into practice.