1. Monitoring and managing memory
Hi! In this lesson we will discuss memory management.
2. The queue and the space
Imagine a queue at the bank. Three tellers attend to three people at a time. For this queue to function properly, the bank must be large enough to fit three people.
You can't put 20 tellers in a small bank. There's simply not enough space for them to work efficiently.
3. The parallel flow
Similarly, in the parallel workflow, a long series of tasks is split into subtasks, each subtask is passed to a core for execution, and results are combined.
4. The parallel flow
All this usually happens in the random access memory or RAM. If we overload the RAM, the R session could crash.
5. The births data
Suppose we have a long list of CSV file paths. Each file contains births data from a state in the US. We'd like to load these files in parallel.
6. Mapping with futures
We'll use the furrr package to do the job.
We plan a multisession with two workers. We use future_map() to apply the read-dot-csv function to every element of the file list. And we revert to a sequential plan.
This loads the CSVs into a list of data frames.
7. Profiling with two workers
We profile this code using the profvis() function.
Looking at Memory in the Flame Graph output, we use one-point-six megabytes of RAM.
8. Profiling with four workers
With four workers, point-three megabytes goes to setting up the multisession, while the mapping takes three-point-one megabytes. Overall, we more than double the memory usage.
This is because, now, four workers are loading CSVs at the same time. More cores or workers in parallel require more memory. While the memory usage here is pretty small, with larger CSVs we could run out of RAM.
9. Behind the scenes
Internally, the list of file paths is divided into roughly four equal-sized chunks. This is called chunking. Here we are assuming that the chunking follows traditional geographical regions.
Elements from each chunk are passed to a worker process. So, at any given time, there are four CSV files being loaded.
10. Managing memory by chunking
We can control the memory usage by manipulating chunk size.
We can supply the desired chunk size to the chunk_size argument of furrr_options().
We set chunk size at 26. So we are dividing our list of 50 states and Washington DC into two chunks, one of 26 elements, the other of 25.
We plan our multisession of four workers, and we supply the configuration we created to the dot-options argument of future_map().
11. Managing memory by chunking
When we profile this code, we can see that we have reduced the memory usage by 25 percent.
It's important to remember that profiling outputs can vary, and tweaking the chunk size will have different effects depending on the context.
12. Chunking with parallel
We can try a similar approach with parLapply().
We make a cluster of four cores. In a parLapply() call, we use the cluster to apply read-dot-csv() to each element of the file list. Once done, we stop the cluster.
If we profile this code, we see that we used two-point-four megabytes of RAM. This is roughly the same as the usage for a multisession of four with future_map().
13. Chunking with parallel
To change the chunk size, we can supply a value to the chunk-dot-size argument. Here we supply 26.
This reduces our memory usage to one megabyte.
14. When to chunk?
Usually, chunks are handled efficiently. In situations where memory is causing crashes, we could reduce the number of cores or workers, although that will result in a speed trade-off. It is good practice to experiment with a few chunk size values to reach the optimum.
15. Let's practice!
Let's practice this functionality in the exercises.