Get startedGet started for free

The parallel apply family

1. The parallel apply family

Up till now, we have focused on parLapply. The parallel package has a lot more to offer!

2. Weight gain during pregnancy

Let's start with the US births dataset. Here we have a list of vectors, one for each state in the US. The elements of each vector represent the weight gained in pounds during pregnancy for each individual. We want to bootstrap a distribution for the average weight gained for each state.

3. FORKing with mclapply

We define our bootstrap function. We use the FORK version of parLapply(), called mclapply(). mclapply() takes the input list and the function. The number of cores is supplied to the mc-dot-cores argument. That's it! Notice that there is no cluster needed here. It's fast, but does not run on Windows as it requires a FORK cluster.

4. Balance the load with parLapplyLB

Here is another parLapply() cousin, parLapplyLB(). The LB stands for load balancing. For the weight gain data we used earlier, we see that the length of the vectors varies. This could be a good place to use the load balancing version.

5. Balance the load with parLapplyLB

Let's see how it performs against parLapply(). We reduce the execution time by about 5%. Not a whole lot, but the benefits could increase with larger differences in input sizes.

6. Multiple arguments to loop over

Suppose we have the function add() that sums three values. We want to apply this function to each value of these vectors. Notice that the first two vectors are of length four, while the third is of length one. We want to iterate over values of all three vectors simultaneously.

7. clusterMap

For this purpose, let's meet clusterMap(). To clusterMap(), we supply the cluster, the function, and all inputs to loop over. And we get our sums. Notice that value3 was automatically recycled four times.

8. clusterMap vs. parLapply

Note how clusterMap() is different from parLapply(), especially the order of arguments. clusterMap() takes the cluster, function, and multiple inputs to loop over. Any static variables that are not looped over can also supplied here as they will be recycled. parLapply() takes the cluster, the single input to loop over, and then the function. This could be followed by any named static variables.

9. Weight gained with clusterMap

Let's see an example. For our pregnancy weight gain data, what if we wanted to bootstrap the weight gained per baby born? Now we have two input lists, one containing the weight gained values, and the other, the plurality values (the number of babies born for a birth). boot_dist() takes weight gained and plurality, calculates the ratio of weight gained per baby, and does a bootstrap. After supplying the cluster and function to clusterMap(), we supply ls_weights to the weights argument, and ls_plur to the pluralities argument. And we're done!

10. Total number of births

Next, consider this matrix of total number of births in the US for a given year. Each column is a calendar month and each row corresponds to a state. We want to calculate the total number of births in the US for each calendar month. This is a column-wise sum.

11. Row and column operations in parallel

parCapply(), or parallel column apply, can help us. We start by generating a cluster. We supply this cluster to parCapply(), followed by the input matrix, and the function to apply to each column. We get our total number of births by month. Before we stop this cluster, let's also get the total births for each state in this calendar year. This is a row-wise sum. Here we use parRapply(). We use the same cluster, input, and function. But this time we get row sums. In general, for all calculation shown in this video, the benefits of parallelization will be noticeable for larger datasets.

12. Let's practice!

Let's get practicing!