1. readr: read_delim
Remember the states2.txt file from before,
2. states2.txt
that uses forward slashes as separators? Before, we've already written
3. states2.txt
this customized read-dot-table call for it. Let's now use readr's more low-level read_delim function to do the exact same thing:
As usual, the first argument is the path to the file. Next, the delim argument specifies the character that is used to separate fields within a record; it's the equivalent of the sep argument in read-dot-table.
The output corresponds to the output of the read-dot-table call, but the readr version outputs a tibble again.
Let's compare the utils and the readr calls here. First off, we didn't have to specify something like header is TRUE, because by default, read_delim expects the first row to contain the column names. It does this with the col_names argument. To control the types of the columns, readr uses the col_types argument, similar to the colClasses argument from utils. Let me dive into col_names first, and then talk some more about col_types.
4. col_names
col_names is TRUE by default. Suppose you have another version of the states file, without column names this time, states3.txt. The first line is already a record now.
5. col_names
Setting col_names to FALSE, leads to automatic generation of column names, like in this example. You can also manually set col_names to a character vector. The names you pass will be used as the names of the columns, and the first line is read as a record, like here:
6. col_types
Next, there's also col_types, to control the column classes. If we just import states2.txt, the file with header names, like before, without specifying col_types, the column types will be guessed from the first 30 rows on the input. The printout of the tibble shows us the class of each column, which is very practical. The first two columns are character, the third is double, and the fourth is integer.
You can also specify the column classes manually. In this call, we enforce the state and city to be a character and the population and area to be both numeric. I used short string representations here: c stands for character, d for double or numeric, i for integer, and l for logical. The result is what we'd expect: the fourth column is a double now.
Instead of c, d, i and l, you can also use an underscore, to skip a column. A totally different way to control the types of the columns is through collector functions. Although more complicated, they are more versatile. You'll learn more about this in the exercises.
7. skip and n_max
If you're working on huge flat files, say one million lines, you might be interested in handling the data in chunks of 50,000 lines for example. This keeps your work tractable and you can easily follow up on the progress of your algorithms. In readr, You can do this with a combination of the skip and n_max arguments. Have a look at the output of this call:
We skipped 2 rows, and then read in three records. There's a problem though! Because col_names is TRUE by default, the first row that's read is used for the column names, but this information has been skipped! We'll have to manually specify some column names this time, by setting col_names to a character vector. This time, the two first rows, so the column names and the first observation are skipped, and the next three observations are read in. Perfect.
8. Let's practice!
Let's see your importing skills progress in the exercises. In the last video of this chapter, we'll talk about the amazing fread function from the data table package!