1. Scaling batch processing
Welcome back! Now that we've discussed the ideas behind batch processing, let's look at some common techniques to scale these tools.
2. What is scaling?
Before discussing specific scaling techniques, let's take a moment and mention what we actually mean by scaling.
Scaling is the idea of improving performance of our processes.
This means we can process data more quickly, meaning for the same amount of data, it takes less time to process compared to before.
A by-product is that we can now process more data in the same amount of time.
Let's take a look at a couple of methods for scaling our processing now.
3. Vertical scaling
The first kind of scaling we'll look at is vertical scaling.
Vertical scaling is running our processing on a better computing platform.
This can include a better or faster CPU (for example 2.5GHz vs 1.8GHz).
Faster IO is a common method for vertical scaling - such as solid state disks vs spinning hard drives, or faster networking components.
A vertically scaled system could also include more or faster memory (RAM) than a previous compute method.
Typically, scaling vertically is the easiest kind of scaling. It is one of the least complex methods to improve performance.
It also rarely requires changing the underlying programs or algorithms as the system simply runs more quickly.
4. Vertical scaling cons
There are some cons to relying on vertical scaling for any performance improvement.
Vertical scaling is inherently limited - there is always a given fastest CPU or max amount of memory in a given system. If you're already using the fastest CPU available, you simply don't have the option to use a better one.
Vertical scaling can also be expensive or have a low return on investment. There is often a price premium for the fastest CPU or the most cores. This premium is usually not linear on a price / performance curve and should be considered.
Finally, it should be noted that industry improvements are not guaranteed. If your data grows by 15 percent per year, there most likely won't be a single system available next year that is 15 percent faster.
5. Horizontal scaling
Now let's take a look at another option, horizontal scaling.
This is the idea of splitting a task among multiple systems, such as multiple computers. If you're already running a task on say four computers, you could scale the task to eight computers. Note that it could also mean adding cores or CPUs to a single system.
Horizontal scaling is best run on tasks that are embarrassingly parallel, meaning tasks that can be easily divided among workers. Consider if we needed to count the number of words on each page of a book. Every page could be made into its own task and sent to a different worker to complete the overall job.
Horizontal scaling can be very cost effective
and can have near-linear performance improvements if your tasks can be split appropriately.
6. Horizontal scaling cons
There are definitely some cons to using horizontal scaling, mainly its complexity.
It typically requires some kind of processing framework like Apache Spark or Dask to properly utilize the machines.
It also needs much more extensive connectivity between the systems so they can communicate.
There are also ongoing management costs
and it can get expensive depending on what you're trying to accomplish.
Finally note that some tasks can't easily be scaled horizontally. Consider a process where each step requires asking a remote server for the next instruction that must be performed in order.
7. Let's practice!
We've discussed some of the ideas around scaling batch processing - let's practice what you've learned in the exercises ahead!