Get startedGet started for free

Computing on columns the data.table way

1. Computing on columns the data.table way

In this lesson, we will see the advantages of being able to use column names as variables in "j", which makes data analysis operations intuitive and succinct.

2. Computing on columns

The reason "j" is extended to allow column names to be seen as variables is so that we can perform computations on columns directly in "j". For example, if you want to compute the mean of "duration" column, all you need to do is to write "mean(duration)" in "j". mean() returns a single value. If you recall from the last lesson, since "j" is not wrapped inside "list()", the result is a vector. Now compare this to how you would do this in the data frame way. You will first select the column and then pass the result to the mean() function as an argument. The data table way of computing directly on columns allows for clear, convenient and concise code and can be easily extended to calculate statistics on multiple columns, which you will see in the next video.

3. Computing on rows and columns

You filtered rows using the "i" argument in the last chapter. Let's say you would like to compute the mean duration for those trips where start station is "Japantown". You can now do this combining the "i" and "j" arguments as shown here. First, you filter rows where start_station equals "Japantown" in "i" and then compute the mean duration in "j". This is possible because "j" is computed on the rows returned by the filtering operation.

4. Special symbol .N in j

Remember the special symbol dot N you used in the "i" argument that holds the number of rows in a data table in the last chapter? You can use it in "j" too. Suppose you want to calculate the number of trips that started from Japantown. You can do this by filtering rows where start_station == "Japantown" and then specifying dot N in the "j" argument. Since "j" is calculated on the result of "i", you get the total number of rows in the filtered data table. In other words, it returns the total trips that were made from Japantown. To get an idea of how convenient and efficient the data table way is, let's compare this to the data frame equivalent. This code shows the most common way of performing the same operation. In this approach, you first return all the columns for the filtered data only to compute the total number of rows. Imagine if your original data was 50GB! The filtering operation alone would take incredible amount of memory and be very slow. In the data table version, no column is actually selected. The expression in "i" helps identify the rows to be extracted. Looking at "j" it is clear that no columns are required but only the total number of rows filtered and therefore it is very efficient both in terms of memory and run time.

5. Let's practice!

Go ahead and practice computing in "j"!