1. Batch issues
Welcome back! Now that we know how to scale batch processes, let's look at some of the common problems that can arise when implementing this solution.
2. Delays
The most common issue in batch processing is the amount of time spent in any portion of it.
There are four primary types of delays we see with batching:
The first is the amount of time until the data is ready to start processing. In the case of batching, this is making sure that all data required for the batch is available. The longer this takes, the longer until we can continue.
The next delay type is the time until the processing begins. In the case of a scheduled process, there could be up to one scheduled interval of time before the data is processed (meaning if data arrives just after the start of the previous run).
Next we have the amount of time required to process the data. Refer to our previous lesson on scaling to understand a bit more about what this means.
Finally, there may be a delay before the data is transferred or made available to end users. There are assorted reasons for this, but mostly it's the amount of time required to copy the data in the final data store.
3. Example #1
Let's take a look at a couple of examples of delays.
In this first scenario we must wait on the source data to be ready.
For this case, consider that a group of machines must copy their log files to a central source for processing. Each machine is configured to send the information when the system resources are at their lowest utilization.
This method works well during normal utilization, but if the resources are busy, the processing system may need to wait a considerable time for enough data to be available.
4. Example #2
The next example represents the typical kind of problem with batch processing - the amount of time it takes to actually process the data.
Sticking to our log processing example, let's consider parsing 100 gigabytes of data per day.
Currently, it takes approximately 23 hours to process that data,
or about 4.4 gigabytes per hour.
But given a successful service offering, our data will grow. In our case, it's growing at about 5% per month.
The next month would be about 105 gigabytes of data and take about 24 hours to complete.
The following month would be 110 gigabytes and take 25 hours. This means that it would take longer than one day to process a day's worth of data!
5. Example #3
The last example we'll consider in this lesson is the time it takes to update an analytics system once our data is processed.
Sticking with the log processing examples, let's say we need a daily report for sales with the number of rows processed, the number of unique systems, etc.
To actually generate this report requires all data be present on the analytic system. We must wait a certain amount of time for each step to complete to finish the process and generate the report. The sum of these times is the minimum amount of time to generate the new report.
This includes the amount of time to generate the data, time to process the data, time to update the systems, and finally the time to generate the report.
While generating this report only takes minutes to complete, it can only show data from 1 point 5 days ago.
6. Let's practice!
You can start to see how issues can arise when handling batch processing in certain scenarios. Let's review some of those issues in the exercises ahead. Then, we'll see you back in the next chapter to discover better ways to handle these scenarios!