1. Filtering rows in a data.table
In this lesson, we will look at how to subset or filter rows from a data table using row numbers and logical expressions.
2. General form of data.table syntax
The general form of a data table is DT square brackets, i, comma, j, comma, by. It is read out loud as, "Take DT, filter rows in "i", then compute "j" grouped by "by". We will look at the first argument "i" here. The functionality is similar to data frames but more convenient and enhanced.
3. Row numbers
Rows can be filtered using row numbers as you would do in a data frame, except the first argument is always interpreted as a row operation irrespective of whether or not you specify a comma. For example, to subset the third and fourth rows, you can use the code as shown here.
If you want to exclude a set of rows, you can use the negative sign or the exclamation mark as shown here.
4. Special symbol .N
data table contains a few handy special symbols (or variables) that make many operations efficient. One such symbol is dot N which is an integer vector of length one. When used in the "i" argument, it returns the total number of rows in the data table. To get the last row, for example, you can simply do batrips square brackets dot N. Similarly, to get all but the last 10 rows, you can specify the "i" argument as 1 colon dot N minus 10. Since dot N is the total numbers of rows, you are essentially creating a continuous sequence of integers from 1 to the required row number.
5. Logical expressions (I)
You can also construct expressions resulting in a logical vector in the "i" argument. This returns only those rows that evaluate to TRUE. However, note how we refer to the columns in data table as if they are variables. This is because within the scope of data table, that is, within the two square brackets, columns are seen as if they are variables. This avoids unnecessary repetition of the dollar sign when referring to column names while constructing expressions and therefore avoids subtle and hard to track bugs. Also, the resulting expression is much cleaner and easier to read.
6. Logical expressions (II)
Here's another example. In this example, we filter for rows where start_terminal is 58 and end_terminal is not 65. Again, note the concise and clear syntax of data table.
7. Logical expressions (III)
As mentioned in the first video, data table is fast! It automatically creates an index (or key) of the columns used to subset the data so that any subsequent operations on that column are incredibly faster. You can have a look at all such indices using the indices() function.
Let's look at an example. First, we create a data table dt with 10 million rows. When we call indices() on dt, we can see that the result is NULL. Thus, at this point, the data table has no indices. When we perform a subset operation for the first time, we can see that it takes about 0-point-2 seconds. Now when we run indices of dt again, notice that x is an index. So when we perform the same operation again, it takes only about 0-point-002 seconds.
8. Let's practice!
It's time for you to practice subsetting data tables.