chunk.apply

1. chunk.apply

So far you have been iterating over data by creating a loop to read, parse and compute. This is like using a for loop.

2. chunk.apply()

In this section you will use a new function chunk apply() that effectively moves us from using a for loop to using apply() equivalent on the chunks. It abstracts the looping process, defines a way to collect results and enables parallel execution. The iotools package is the basis for the hmr package, which allows you to process data on the Apache Hadoop infrastructure. These packages are used at places like AT&T Labs to process hundreds of terabytes of data.

3. mstrsplit() reads chunks as matrices

Assume we have a file, foo dot csv with 3 columns containing numeric values and we want to get the sum of the columns. The code in this slide shows how we process this file in chunks using chunk apply(). The first argument is the file that we will read from. The second argument is a function with one argument - chunk. Here we will turn the chunk into a matrix using the mstrsplit() function. Then the function processes the chunk by taking the column sums and keeps the sums of the chunk columns as an intermediate variable. chunk apply() returns a matrix where each row corresponds to a chunk and each column is the chunk-sum of foo dot csv's 3 columns. You can get the total sum of all the columns with another call to colSums().

4. dstrsplit() reads chunks as data frames

In the previous example, we parsed each chunk into the processing function as a matrix using mstrsplit(). This is fine when we are reading rectangular data where the type of element in each column is the same. When it's not, we might like to read the data in as a data frame. This can be done by either reading a chunk as a matrix and then converting it to a data frame, or you can use the dstrsplit() function. This function takes a chunk, just like mstrsplit(), and produces a data frame with column types you specify. Moreover it allows you to pick and choose subsets of fields from the data very efficiently.

5. Parallelizing chunk.apply()

The chunk dot apply() function also has a parallel option to process data more quickly. When the parallel option is set on Unix to a value greater than one multiple processors read and process data at the same time thereby reducing the execution time.

6. Note about parallelization

It should be noted that increasing the number of processors won't always speed up your code. This means doubling the number of processors won't necessarily half the speed of execution. There are usually diminishing returns when you add processors on a single machine to a calculation.

7. Let's practice!

Now that you've seen how to read in matrices and data frames, and parallelize your chunk dot apply() code, let's try some exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.