1. Introduction
bigmemory is a good solution for processing large datasets, but it has two limitations.
2. bigmemory
The data must all be stored on a single disk, and it must be representable as a matrix.
3. iotools
If your data is better represented as a data frame, or you want to split it across multiple machines, then you need an alternative solution.
To scale beyond the resource limits of a single machine, you can process the data in pieces. The iotools package provides a way to do this by processing data chunk by chunk. It can also be extended to entire clusters. In this chapter, you'll learn when and how to use this package to process large files.
4. Process one chunk at a time sequentially
When you split your dataset into pieces - sometimes called chunks - there are two ways to process each chunk: sequential processing and independent processing.
Sequential processing means processing each chunk one after the other. R sees only part of the data at a time, but can keep arbitrary information in the workspace between chunks. Code for sequential processing is typically easier to write than for independent processing, but the computations cannot be parallelized.
5. Process each chunk independently
Independent processing allows operations to be performed in parallel by arbitrarily many cores or machines, but requires an additional step, because the final result has to be combined from all the individual results for each chunk at the end. Hence this approach fits into split-compute-combine and is similar to map-reduce.
6. Mapping and Reducing for More Complex Operations
Some operations can be trivially used on chunks. For example, computing the maximum of each chunk - and then maximum of the chunk results - yields the maximum of the entire dataset.
Calculating the mean of a vector is a little trickier. For example, if your chunk sizes are different, taking the mean of the mean of the chunks may not give the mean of the vector. However, if for each chunk you record both the sum of the values and the size of the chunk, then you can find the mean by taking the sum of the chunk sums divided by the sum of chunk sizes.
7. Not all things fit into Split-Apply-Combine
Finally, there are some things you can't easily compute using split-compute-combine. A good example is the median of a vector. If you take the median of the median of chunks, you cannot guarantee that the result is the median
of the entire vector. You also can't compute the median by keeping a small amount of extra information like you did with the mean. The problem is that, to calculate the median, you have to sort the vector and take the middle value. That is not an operation that can be performed in the split-compute-
combine framework alone.
8. However ..
Fortunately most regression routines, including the linear and generalized linear models, can be written in terms of split-compute-combine.
In this chapter you'll take a look at how we perform these calculations as sequential chunks, usually on large files, without creating other files like we
did with bigmemory.
9. Let's practice!
Let's start by doing this in base R.