1. Adding and updating columns by reference
In this lesson, you will look at a very unique feature of data table, reference semantics, which allows you to add, update and delete columns of a data table in place.
2. data.frame internals
Suppose you have a data frame "df" as shown here and you would like to change the second row of column "y" to 10 instead of 7.
You can do something like this.
3. data.frame internals
Now how does R handle this internally? In versions of R prior to 3-point-1-point-0, this operation resulted in deep copying of the entire data frame.
That is, the entire data frame "df" was copied in memory to a completely different location and assigned a temporary variable name, say "tmp". The new value was updated on "tmp" and the result was then assigned back to "df".
Now think about how much memory you will require if your data frame is 10GB. You will need at least 10GB more of free RAM just to replace a single value of a single column! This was obviously not memory efficient.
4. data.frame internals
Improvements were made in version 3-point-1-point-0 to not deep copy the entire data frame while updating columns.
This example data frame df has four columns. Say you would like to replace all "even" values with NA in the first two columns, you can do that as shown here.
From version 3-point-1-point-0 on, only columns "a" and "b" are deep copied instead of the entire data frame. This is a great improvement.
However, the columns being updated are still deep copied. Imagine your data frame is 10GB with 100 columns and you are updating 50 columns. You would still need 5GB of extra space to update those columns.
Surely, deep copying just the columns used is a welcoming improvement but by no means the most efficient way of updating columns.
So how does data table handle this efficiently?
5. data.table internals
The data table package does not deep copy objects or create any temporary variables. It simply updates the original data table by reference.
Since the original data table is directly updated, there is no need to assign the result back to a variable. It is therefore extremely fast and memory efficient.
data table uses a new operator colon equal to (:=) to perform this.
6. LHS := RHS form
There are two ways of using data table's colon equal to operator.
The first is the left-hand-side colon equal to right-hand-side form. It takes a character vector of column names on the left-hand-side of the ":=" operator and a LIST of values on the right-hand-side of the ":=" operator, corresponding to each of the column names.
In the example shown, two columns are being added to the original data table batrips. The first one is TRUE if the duration is greater than 1 hour and the second is the weekday of each trip.
For convenience, you can skip the quotes around column names on the left-hand-side if a single column is added or updated.
Note that the result is not assigned to a new variable. The original data table batrips will now have these two columns added to it.
7. Functional form
The second way of using the ":=" operator is the functional form. It takes the form "col1 = val1", "col2 = val2" etc.. as arguments to function ":=()". Note that when you are using operators as functions, they need to wrapped inside backticks.
Assigning NULL to a column deletes that column by reference. Here, "is_dur_gt_1hour" column is deleted. In addition, "start_station" column is updated by reference to all upper case.
It is perfectly fine if you prefer to stick to one method over the other.
8. Let's practice!
Now it's your time to use the colon equal to operator.