Get startedGet started for free

A first look at iotools: Importing data

1. A first look at iotools: Importing data

2. Chunk-wise processing

The basic components of chunk-wise processing are loading pieces of the data, converting it to native objects that we can compute on, performing the desired computation and storing the results. This sequence is repeated until you have processed all of the data. In this chapter we will discuss each of these pieces and show how the iotools package simplifies and speeds up the process.

3. Importing data

One often overlooked aspect of dealing with large data is the import of data. In real applications, it often takes more time to load data than to process it. There are two contributing factors: the data has to be retrieved from disk which is a relatively slow operation, and the data has to be converted from its raw form (typically text) into native objects (vectors, matrices, data frames). In chunk-wise processing, you have to load a chunk of the data, process it, keep or store the results and discard the chunk. It means that you typically cannot re-use the loaded data later as you keep processing different chunks. This means you have to provide efficient functions for loading data.

4. Importing data using iotools

The iotools package provides a modular approach where the physical loading of data and parsing of input into R objects are separated for better flexibility and performance. R provides raw vectors as a way to handle input data without performance penalty. This means we can separate the methods by which we obtain a chunk from the process of deriving data objects from it.

5. iotools: Importing data

There are two main methods to read data: readAsRaw() reads the entire content and read dot chunk() reads only up to a pre-defined size of the data in memory. The result of both is a raw vector that is ready to be parsed into R objects.

6. iotools: Parsing data

Parsing text-formatted file can be done using the mstrsplit() function for reading matrices and dstrsplit() for reading data frames. Both are optimized for speed and efficiency.

7. iotools: Loading and parsing data

Although this design was chosen specifically to support chunk-wise processing, we can benefit from this approach even if we want to load the dataset in its entirety. For example, read delim raw() is a fast replacement of the read delim() function by combining readAsRaw() and dstrsplit().

8. Chunk-wise processing

Processing contiguous chunks means you don't have to go through the entire dataset ahead of time to partition it. Splitting simply means reading the next set of rows from the data source and there is no intermediate data structure (like the list returned from the split() function) to store and manage.

9. File connections

In practice, you open a connection to the data, read a chunk, parse it into R objects, compute on it and keep or store the result. You repeat this process until you reach the end of the data. You can control the chunk size which will limit the amount of data processed in one iteration. R is really good at vector and matrix operations, so this approach is very efficient by processing an entire chunk at a time.

10. Let's practice!

Now that you've seen how to read and parse data, let's put it to practice.