Get startedGet started for free

Debugging in parallel

1. Debugging in parallel

Hi there! In this lesson we will learn how to debug parallel code.

2. What is debugging?

Debugging is the process of fixing errors in code. Debugging parallel code is tricky. Multiple R sessions are running simultaneously, and it is harder to locate and fix errors.

3. Reading files in parallel

Let's look at an example. Here we have a list of file paths. Each file contains stock price data for a given year. We'd like to filter each CSV file for Tesla stocks.

4. The filtering function

We write a function that takes one file path. It reads the CSV, filters the data, and writes back to the file path.

5. The parallel apply

To filter in parallel, we set up a cluster of four cores. Since the filterCSV() function uses dplyr, we load dplyr in each core. In a parLapply() call, we use the cluster to apply filterCSV() to all elements of file_list. Since the results are written as CSVs, we save the empty outputs in a dummy variable. And we stop the cluster once done. When we run this, we get an error.

6. The sequential run

The first step in debugging parallel code is to run the code sequentially. We select the first five elements of file_list. We then use lapply() to apply filterCSV() to these elements. This time the code runs without any errors. If we load the first file, we can see that the data has been filtered.

7. Locate the error

The second step is to locate the error. Usually, the error message can guide us. The last line of the error message says that the "Company" column is missing.

8. Locate the error

We go back to our filterCSV() function and copy its contents to a modified version, filterCSV_debug(). We then add a print statement. We print the file path and all the columns in the file. We want to print this in a single line because many parallel processes might print at the same time. So we paste the file path to the column names using the paste() function. We use the paste0() function and supply a comma to the collapse argument, so that the file path and column names are printed in line.

9. Locate the error

We apply the new function, filterCSV_debug() in parallel. We get the same error. But nothing was printed either! This is because parallel processes cannot print directly to the main R session.

10. Locate the error

To ensure print messages are saved, we can provide a file name to the outfile argument of the makeCluster() function. Here we have named the file log-dot-txt. This file will store all print messages from the cluster. We then run our code with the filterCSV_debug() function.

11. Examining logs

Once this is done, we can open the log-dot-txt file. Here we can see that the problem must have occurred in 2017-dot-csv. The file is missing the Company column. While we could have just looked at the CSV files manually, this becomes tedious when the length of our inputs is large. Storing debug messages is also good practice for reference.

12. Debugging with foreach

We can use the same method with parallel foreach loops when using makeCluster() to set up the cluster.

13. The good thing about furrr

With furrr, the messages are printed as the futures are resolved. So let's run this code as is.

14. The good thing about furrr

All messages are printed until we hit the error. Pretty neat!

15. The steps

In general, we first check that our function works sequentially on a small subset of the input. We then examine the error message so that we get an idea about what to investigate. Printing diagnostic messages, or logging them to a file can help locate the error. We can then fix our code.

16. Let's practice!

Now let's do some debugging!