1. Advanced foreach operations
In this lesson we will look at some advanced operations with foreach.
2. The case of the enormous CSV
Suppose we have a very large CSV file of price data for all stocks, and we only want to analyze a few tech company stocks.
Loading the file gives an error. Reading the error message we see that we are running out of memory.
Even if we do manage to load it, it will be cumbersome and wasteful to deal with all the data. How do we manage this task efficiently?
3. Iterators are it
Iterators can help us process the data in small manageable chunks. The iterators package has many functions to iterate over vectors, lists, etc. But we will focus on ireadLines().
This function takes a CSV file path and generates an iterator object. This object will read one line at a time rather than loading the whole CSV into memory.
4. Iterators with foreach
The coolest thing is that we can supply this iterator to foreach!
Within the loop body we filter for lines that contain the word "Tesla" using the grepl() function. Notice that we can use a return() statement with foreach, unlike base R loops.
5. %dopar% the CSV
And we can do this in parallel!
We do the usual cluster setup.
We supply the ireadLines() iterator to foreach() and specify the "rbind" combine method.
We select lines containing the word "Tesla". Each line is read as a single string. So we split the string into the columns using strsplit(), and remove extra characters using gsub().
When we run this, we get the data loaded and filtered!
6. Stock prices
We apply this process to all the stocks we want to analyze and we have our data ready.
7. Three-day moving average
Now we want to calculate a three day moving average. This means we want to calculate an average stock price for every three day period in our dataset. So, an average from day one to three, then day two to four, and so on.
8. Three-day moving average for Tesla stock
We focus on Tesla stock prices for now. Of course, we want to do this in parallel.
We make and register the cluster. We initiate the foreach loop. Notice that we are indexing by rows this time because we want to access multiple rows in one iteration. With the "%dopar%" operator for parallel processing, we take the mean of three rows of the Tesla column, from the current row to the row plus two.
Since we didn't specify anything in the dot-combine argument, we get a simple list of moving averages.
9. Moving average for all columns
So we have the code to calculate moving averages for one column of this dataset. How can we do this for every column of this dataset? An intuitive way would be to iterate over columns, and within each iteration for a given column, loop over the rows.
10. Nested loops
This would result in a nested loop.
We first extract the dimensions of the dataset. We do the usual housekeeping, and set up the first loop on columns. Since these will be columns of our output, we specify "cbind" for dot-combine. But instead of "%dopar%", we specify this colon-operator. This is the way to nest multiple calls of foreach().
In the inner loop, we iterate over rows. The results of this loop are individual averages, so we combine them into a vector with "c" or concatenate. We then follow this with "%dopar%" and write the code for each iteration.
11. Moving averages
When we run this, we get our moving averages.
12. Let's practice!
Let's try out this functionality in the exercises!