Reproducibility in parallel

1. Reproducibility in parallel

Hello! In this lesson we will discuss reproducibility in parallel programming.

2. What is reproducibility?

Reproducibility is the property of code to consistently produce the same results with the same inputs, regardless of the computational environment. This makes the code easy to test since the expected output for some input is known. This is especially true for code that will be run repeatedly, like in an application.

3. The customer lucky draw

Suppose we work for a multi-national telecommunications company. We have a list of customer IDs; each element contains IDs from one country. Every month our company holds a lucky draw to offer a paid vacation to one customer from each country. To be fair, our company wants the selection to be completely random. Let's code this in parallel.

4. The customer lucky draw

We write a function that randomly selects a customer ID. The sample() function randomly picks one element of the IDs vector. This random pick requires a random number generator or RNG. We have arbitrarily chosen to set up a cluster of four. To make our results reproducible, we set a seed using the set-seed function. Setting the seed fixes the starting state of the RNG. In a parLapply() call, we use the cluster to apply lucky_draw() to customer IDs from each country. We run this to get the customer IDs of the winners.

5. The reproducibility problem

Suppose our colleagues want to check if the winners were selected fairly. They rerun the same code with the same inputs, but they end up with completely different results! Not only that, we have no way of testing if the code is working as expected.

6. Solution

Here is one possible solution. We make a cluster and use the clusterSetRNGStream() function to set a seed for the cluster we just created. This function creates multiple streams of random numbers for cores in the cluster. The seed can be any value, here we choose 1234. And from there we proceed as before.

7. Multiple runs with same results

We run this code to get IDs of the lucky draw winners. And a second run produces exactly the same results.

8. Multiple runs with same results

We can also program this reproducibility test. We do a first run with a specific seed, and we save the results to a variable called run1. We do a second run with the same seed and save the results to the variable run2. The identical() function from base R compares the value and structure of two R objects. It returns TRUE if they are the same, and FALSE otherwise. A comparison of run1 and run2 shows that they are identical.

9. Reproducible results with furrr

We might need to make our results reproducible with different packages, depending on how we parallelize. For furrr functions, we specify a seed value to the seed argument of furrr_options(). This creates a configuration that can be reused. We first use future_map() to apply lucky_draw() to every element of customer_ids. We repeat the calculation for our second run using the same configuration and we get identical results.

10. Reproducible results with foreach

With foreach we need to install and load another package, called doRNG. We set up and register our cluster. We supply a seed value to the registerDoRNG() function, and generate the results for the first run using the do-par operator. With the same seed we get identical results in a second run.

11. When to think about reproducibility

Reproducibility must be considered when generating random numbers. This includes all the distribution functions that begin with "r". Random sampling can also cause reproducibility issues. A bootstrap is a classic case. Some functions from other packages might use random sampling internally, such as sample_n() from dplyr which randomly samples rows from a data frame.

12. Let's practice!

Now let's practice making our code reproducible!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.