Using foreach

1. Using foreach

In this lesson, we will get to know foreach and it's utility in R programming.

2. A new loop

Let's look at our toy example of square roots. This is the native for loop in R to calculate square roots of numbers one to one million. We generate an empty vector to collect results. We initiate the for loop and loop over each number, taking the square root. We store the results in the empty vector based on indices, the variable "i". The foreach package in R has a different syntax for this kind of loop. Here we don't need an empty vector to collect results. We initiate the foreach loop, do some operation on each element using the percent-do-percent operator, and assign its output to our result. The syntax is very intuitive, we could almost read it off in English: For each element "i" in numbers, do square root of "i". Please note that the variable "i" here represents the actual element of numbers, not the index. Cool, but why do we need this?

3. Parallel loops

The real utility of the foreach function is in parallel computation. Let's see how we can run a loop in parallel. Here is the loop we saw in the previous slide. To run this loop in parallel, we will first generate a cluster. Now we will register this cluster with the foreach backend, using the function registerDoParallel() from the doParallel package. This function takes the cluster as an argument. The actual loop remains the same but in the percent-do-percent operator, we replace "do" with "dopar". We stop the cluster once done.

4. Top engineering universities

Let's see a real example. We have a list of file paths called uni_list. Each CSV file contains university ratings from a country. Let's say we want to read all the data in parallel. We set up a cluster and register it with foreach. With the foreach function we initialize a loop. In the loop body we read each element of uni_list using the read-dot-csv function. And we stop the cluster.

5. Collecting results with foreach

The output of this parallel loop is a list of data frames. The foreach() function has helpful arguments to modify the behavior of the loop. Take, for instance, the dot-combine argument. This argument takes a function name and applies that function to the output. So if we supply "rbind" to this argument, all our data frames are row-binded into a single data frame.

6. Read, filter, and combine

Now let's say we want to load the data from our list of CSV file paths into a single data frame, filtering for the universities with the three highest scores in a given country. Our native for loop could be something like this. We load dplyr and specify the number of top universities to select. We create an empty list and loop over our file paths, reading and filtering with dplyr's top_n(). And we can then combine the results into one data frame.

7. foreach for the win

foreach() allows us to do this very neatly in parallel, but requires some housekeeping. We set up and register our cluster. Here is the foreach() parallel loop. The dot-combine argument can do the "rbind"-combining for us. However, this being a PSOCK cluster, we will need dplyr loaded in our parallel processes. So we supply "dplyr" to the dot-packages argument. n_unis will need to be exported to the parallel processes too. We do this by supplying "n_unis" to the dot-export argument. And we get our data loaded and filtered into one data frame!

8. Let's practice!

Awesome! Now let's practice some parallel loops.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.