1. The parallel package - parSapply
R has a number of apply functions.
2. The apply family
In this video we'll look at sapply, but lapply, where we apply a function to each element of a list, is same the idea.
3. The sapply() function
The function sapply is just another way of writing for loops. For example, at each iteration of this loop we call the simulate function. This can easily be rewritten using sapply. The idea is that we are applying a function to each value of the vector. In this case, the vector is the sequence one to ten and the function is simulate. In general there is speed little difference in speed between using sapply and a standard loop. What's neat is that to run the code in parallel we simply substitute sapply for parSapply.
4. Switching to parSapply()
Basically, it's the same routine as using parApply. You Load the package; Create the cluster; Change to parSapply; Close the cluster.
Nay bother.
5. Example: Pokemon battles
Let's have a concrete example. During your weekly Pokemon battles, you notice that there appears to be a positive relationship between defense and attack. A quick scatter plot and the resulting correlation value confirm this suspicion. Getting a handle on the uncertainty of the correlation estimate is bit tricky. One easy solution is to use bootstrapping.
6. Bootstrapping
The idea behind the bootstrap algorithm is that in a perfect world, we would just go and get another sample from the population. However in practice this isn't possible. So instead of sampling from the population, you re-sample using your original data set. Essentially there are only two steps. First, sample from the original data set with replacement; note the key word: replacement. This means a data point from the original data set can appear multiple times in your new sample. So if your original data set was of size one hundred, our new bootstrapped sample would also be one hundred. Second calculate the correlation coefficient of the new data set.
These two steps are repeated multiple times. The distribution of correlation values, gives us a measure of uncertainty about the correlation statistic. To run in parallel you begin by creating a function
7. A single bootstrap
that creates a single bootstrap. This function has a single argument - the Pokemon data set. You can wrap this function with sapply and hence parSapply. Let's switch to parallel.
8. Converting to parallel
So you begin by loading the parallel package and create a cluster object. Then we have an additional step involving clusterExport. This is where you explicitly export functions and data sets. You need this step since make Cluster doesn't copy all objects by default for efficiency reasons.
After exporting, you change to parSapply; and shutdown the cluster. The obvious question is was the effort of converting this code to run in parallel worth the effort.
9. Timings
The timing graph shows the relative speed for running the processes in parallel compared to the single core version.
The blue line is the relative time when using a single core. When the number of bootstraps is less than one hundred, it's quicker to stick with one core.
When you carry out more than one hundred bootstraps, the computation time outweighs the extra overhead of CPU communication and moving to parallel is worthwhile.
This graph is typical. It's not always faster to run code in the parallel. We need to take a step back and determine if it's worthwhile for our particular problem. However, the cost of trying things in parallel is relatively low.
10. Let's practice!
In the exercises, we investigate this trade-off more.