Parallelization in R

1. Parallelization in R

In this lesson we will look at some practical considerations of parallelization in R.

2. A practical example

We start with an example. Consider this list of CSV file paths. Each file contains university rankings for a given country. We would like to read these files, create a new column indicating whether a university is ranked among the top 100, and overwrite this new data frame into the same file.

3. Add a column

One implementation could look like this: for each file we read the data, initialize a column, and loop over every row. We then overwrite this new data frame into the same file. Putting aside all other optimizations, how do we parallelize this code? More importantly, which part do we parallelize?

4. Profiling

Profiling can help us answer this question. We can think of profiling as following a cooking recipe and noting the time and effort it takes for each step. To do this in R, we load the profvis package. Here we have our code as it is. We supply all this code to the profvis() function. Notice the curly braces wrapping the code. This actually produces a detailed visualization, but we'll focus on the flame graph. For each line of code we get the memory usage and execution time. Profiling results can change every time we call profvis(). Consider it an approximation. Here, most of the time is likely spent reading the data and looping over rows. Some lines, like writing the CSV, were too fast to be measured.

5. Let's parallelize

To cover the slowest parts, we parallelize the whole loop. Here is the original code. For us to be able to parallelize this with parLapply(), we wrap it into a function called add-underscore-col. This function takes a file path as argument, loads the data, creates the column, and writes the data back to the file path. Suppose we have eight cores available. In practice, we'd want to reserve some resources for orchestration. So we create a cluster of six cores. We apply add_col() to file_list using the cluster in a parLapply() call. add_col(), like every other function in R, returns at least NULL. We capture this into a dummy variable as the desired output will be stored in a CSV file.

6. Practical considerations: number of cores

In the previous example we had eight cores available, and we made a cluster of six. What if we run this on a machine with only four cores? This might crash the system due to overload. Enter the detectCores() function. detectCores() returns the number of cores available on any system. In practice, to make our code safer we detect the number of cores, and use all the cores except two. Usually leaving out one or two cores for orchestration is enough.

7. Practical considerations: cluster type

Another consideration is the type of cluster. The reason we've not specified the type up till now is because we were using the default type, which is a PSOCK cluster. This cluster initiates new R sessions to execute on each core. Because these are new R sessions, they do not share the local workspace. The key advantage is that they work on any operating system. The other type is a FORK cluster. This cluster creates subprocesses from the current R session, a procedure called "forking". Subprocesses share their workspace. This means less communication between cores, and hence they are faster than PSOCK clusters. Forking was developed for Unix systems (Linux and MacOS), and is not supported on Windows. In this course we will focus mainly on PSOCK clusters since they can be used on any platform.

8. Let's exercise!

I have said "practical" a lot in this video, so let's exercise!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.