parLapply in real life

1. parLapply in real life

Hi everyone! In this lesson we will look at some helper functions that can facilitate the use of parLapply.

2. Let's meet the workers

We have been setting up our clusters like this. In effect, we are setting up independent instances of R in each of the four cores. These instances are called processes. Each process has a unique worker ID. What if we wanted to check if each process is active and working? The clusterEvalQ() function can help us do that. It takes two arguments, a cluster and the expression to evaluate in each process. Here we want to print a simple statement, extracting the IDs using Sys-dot-getpid function. Note the curly braces around the multi-line R expression that we supply to clusterEvalQ.

3. Filtering data in parallel

Let's see how this can be useful in a real life application. Here we have a list of CSV file paths, and each CSV contains healthcare spending data for different geographical regions. We want to load each CSV and filter out the rows where the health expenditure is NA.

4. Filtering data in parallel

We write a function to perform the task on one CSV, and then, parallelize as usual. But when we run this we get an error. Seems like the pipe function from dplyr has not been loaded on each process or worker. This is not surprising since the default cluster type, PSOCK, does not share memory. Any packages we loaded before parallelization are not loaded in the processes.

5. clusterEvalQ to the rescue

So what do we do? We make the cluster, and supply the package-loading code to clusterEvalQ() to load the packages in the processes. And we get our results! We could also run any pre-processing code with clusterEvalQ() on each process, we just have to make sure to wrap the multi-line code in curly braces before supplying it to clusterEvalQ().

6. Filtering with conditions

Now, what if we wanted to filter the data from a certain year onwards. We could incorporate this into our function as an extra argument. And we define the year as, let's say, 2010.

7. Filtering with conditions

We set up the cluster as usual, and load the dplyr package for each process. But how do we tell the processes in the cluster that we have selected this year? Here we will use another function to interact with the worker processes: clusterExport(), as the name implies, can export variables to the cluster. We supply it the cluster we want to export to, and the name of the variable to export. This could also be a vector of multiple variables names as strings. The envir argument of clusterExport specifies where to export the variables from. We supply the environment() function to this argument so that we export from the current environment. We proceed as usual, with one exception: In the parLapply call, we specify the extra argument by supplying the variable selected_year to the min_year argument of filterCSV(). And then we stop the cluster.

8. Filtering with conditions

When we run this code, we get our data filtered for the year 2010 and onwards.

9. Cluster hygiene checklist

So hopefully by now, a sort of parallelization check list is formulating in our heads. Once we have decided to parallelize we must see the number of cores available and leave at least one core free. We set up the right type of cluster. PSOCK is a good default choice due to its compatibility. We load up all the libraries needed to run our function in independent processes. We export any extra variable. We supply this variable to the named argument of our function. We then stop the cluster once we are done.

10. Let's practice!

Now let's apply these concepts!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.