Fast data reading with fread()

1. Fast data reading with fread()

You maybe familiar with read dot csv() or read dot table() functions which import flat files into R. In this chapter, you will use data table's high-performance function fread() to import flat files. So what distinguishes fread() from the other file readers?

2. Blazing FAST!

fread() is blazing FAST! It can import files in parallel on machines where multi-core processors are available. By default, fread() uses ALL available threads. You can use the argument nThread to control the number of threads fread() uses.

3. User-friendly

fread() is also very user friendly. You can read local files from disk, files from URLs, and strings that need to be parsed using the same syntax! It can automatically guess column types, skip lines if necessary which is useful when reading files that contain comments or metadata about the file, handle quotes, separators, white spaces etc. One thing to note though is that at the moment, date and datetime columns are read in as character columns. These columns can be converted later using the excellent fasttime or the more recent anytime packages.

4. Fast and friendly file reader

On this slide we show how the same dataset in three different formats can be imported into R using fread(). As you can see fread() automatically detects if it is a file name, URL or string for you. Note how the header is automatically detected as well. If there's no header, fread() automatically names them "V1", "V2" etc. Similar to the data table() function. This example shows how you can import data using a filename, URL, or a string. That's it. Everything else is taken care of automatically. If you would like to know what's going on under the hood, set the verbose argument to TRUE. In the rest of this chapter, we will use fread() to import data only through strings so you can see how the data looks like.

5. nrows and skip arguments

Although fread() is clever enough to guess the column types and other details in most cases, there are times when you might appreciate a finer control when importing your data. Let's see how these finer controls work. The arguments nrows and skip allow you to control which rows or lines of a file are imported. The nrows argument specifies the total number of rows to read, excluding the header row. It takes an integer as input. This is particularly useful when you want to have a quick look at the file by reading, say, just the first 100 rows instead of millions of rows. The skip argument also takes an integer as input and skips that many number of lines before attempting to parse the file. This is particularly useful in handling irregular files, for example, files with comments, or metadata at the beginning of a file. Even though fread() tries to handle this automatically, it might be required to manually specify the skip argument in some special cases.

6. More on nrows and skip arguments

The skip argument can also take a string as input. In this case, fread() searches for the first exact occurrence of that string and parses the file from the line that string occurs. Note how all text before the string "a,b" is skipped. And finally, you can use skip and nrows together to skip a few rows and read a specified number of rows from there as shown here.

7. select and drop arguments

The arguments select and drop allow you to control which columns are imported. Since this is done while parsing, it is very efficient. Both arguments accept a character vector of column names as well as column numbers. By default, all columns are parsed. The select argument parses the file only for the specified columns. The drop argument reads in all columns except the specified ones. Note that you can not combine both the arguments in the same function call.

8. Let's practice!

Now it's time for you to import data using fread()!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.