1. What is parallel computing
Hi again! Now that you've learned everything about databases let's talk about parallel computing. In data engineering, you often have to pull in data from several sources and join them together, clean them, or aggregate them. In this video, we'll see how this is possible for massive amounts of data.
2. Idea behind parallel computing
Before we go into the different kinds of tools that exist in the data engineering ecosystem, it's crucial to understand the concept of parallel computing. Parallel computing forms the basis of almost all modern data processing tools. However, why has it become so important in the world of big data? The main reason is memory and processing power, but mostly memory.
When big data processing tools perform a processing task, they split it up into several smaller subtasks. The processing tools then distribute these subtasks over several computers.
These are usually commodity computers, which means they are widely available and relatively inexpensive. Individually, all of the computers would take a long time to process the complete task. However, since all the computers work in parallel on smaller subtasks, the task in its whole is done faster.
3. The tailor shop
Let's look at an analogy. Let's say you're running a tailor shop and need to get a batch of 100 shirts finished. Your very best tailor finishes a shirt in 20 minutes. Other tailors typically take 1 hour per shirt. If just one tailor can work at a time, it's obvious you'd have to choose the quickest tailor to finish the job.
However, if you can split the batch in 25 shirts each, having 4 mediocre tailors working in parallel is faster. A similar thing happens for big data processing tasks.
4. Benefits of parallel computing
As you'd expect, the obvious benefit of having multiple processing units is the extra processing power itself. However, there is another, and potentially more impactful benefit of parallel computing for big data. Instead of needing to load all of the data in one computer's memory, you can partition the data and load the subsets into memory of different computers. That means the memory footprint per computer is relatively small, and the data can fit in the memory closest to the processor, the RAM.
5. Risks of parallel computing
Before you start rewriting all your code to use parallel computing, keep in mind that this also comes at its cost. Splitting a task into subtask and merging the results of the subtasks back into one final result requires some communication between processes. This communication overhead can become a bottleneck if the processing requirements are not substantial, or if you have too little processing units. In other words, if you have 2 processing units, a task that takes a few hundred milliseconds might not be worth splitting up. Additionally, due to the overhead, the speed does not increase linearly. This effect is also called parallel slowdown.
6. An example
Let's look into a more practical example. We're starting with a dataset of all Olympic events from 1896 until 2016. From this dataset, you want to get an average age of participants for each year. For this example, let's say you have four processing units at your disposal. You decide to distribute the load over all of your processing units. To do so, you need to split the task into smaller subtasks. In this example, the average age calculation for each group of years is as a subtask. You can achieve that through 'groupby.' Then, you distribute all of these subtasks over the four processing units. This example illustrates roughly how the first distributed algorithms like Hadoop MapReduce work, the difference being the processing units are distributed over several machines.
7. multiprocessing.Pool
In code, there are several ways of implementing this. At a low level, we could use the `multiprocessing.Pool` API to distribute work over several cores on the same machine. Let's say we have a function `take_mean_age`, which accepts a tuple: the year of the group and the group itself as a DataFrame. `take_mean_age` returns a DataFrame with one observation and one column: the mean age of the group. The resulting DataFrame is indexed by year. We can then take this function, and map it over the groups generated by `.groupby()`, using the `.map()` method of `Pool`. By defining `4` as an argument to `Pool`, the mapping runs in 4 separate processes, and thus uses 4 cores. Finally, we can concatenate the results to form the resulting DataFrame.
8. dask
Several packages offer a layer of abstraction to avoid having to write such low-level code. For example, the `dask` framework offers a DataFrame object, which performs a groupby and apply using multiprocessing out of the box. You need to define the number of partitions, for example, `4`. `dask` divides the DataFrame into 4 parts, and performs `.mean()` within each part separately. Because `dask` uses lazy evaluation, you need to add `.compute()` to the end of the chain.
9. Let's practice!
That was the final example of this video. In the exercises, you'll use the packages yourself. Good luck!