1. Congratulations!
At this point, we've shown you how to write code that lets you scale your computations to larger data sets.
2. Split-Apply-Combine
The examples we showed you break the data into parts, computes on the parts, and combines the results. We called the approach split-compute-combine.
3. Split-Apply-Combine: Advantages
By splitting the data we guarantee we are working with manageable subsets that don't overwhelm the available computing resources. This approach also makes our tasks easy to parallelize since we can process each of the parts independently of each other. These operations can be executed sequentially, in parallel on a single machine, or over many machines in a cluster.
4. Split-Apply-Combine: R
We showed you a few things about base R you may not have known about. We showed how you can use the split() function to partition row numbers of data frame into parts, then use the Map() function to perform some operation on those parts, and then use the Reduce() function to combine the results.
We've also covered two tools for processing data using split-compute-combine.
5. bigmemory
The first tool we covered was bigmemory. If your data is big compared to the amount of RAM your computer has, and you can represent the data as a dense matrix, then bigmemory may be a good fit. It stores data on the disk, only moving it into RAM when needed. You access and manipulate the data in almost the same way as you would with a regular R matrix. Data are moved from and to the disk automatically.
6. iotools
The second tool we covered was iotools. The iotools package reads data from the disk, or another location, in contiguous chunks. A chunk is processed, an intermediate value is stored, the next chunk is retrieved, and the process continues until we've processed the entire file. While iotools doesn't allow you to directly retrieve any value in the file without processing the entire file, it is more flexibile than bigmemory in that it can be used with data frames, it doesn't require you to store a single data structure on the disk, and it can process a greater variety of inputs.
7. Visualization
We've used both of these tools to create various tables and visualizations of the Federal Housing Finance Associations mortgage data set.
8. Good luck!
Thank you for taking the course. Good luck working with Big Data in R.