Get startedGet started for free

Batching vs. streaming

1. Batching vs. streaming

Welcome back! Now that we've discussed some basics about streaming, let's take a moment and compare the differences between batching (from chapter 1) and streaming.

2. Quick review

Before comparing the details, let's remind ourselves of the details for batch, queue, and stream processes. Batch processes handle data in groups, or batches. The most important details for batch processing is the batch size (the number of items processed at a time) and the batch frequency (how often items are processed). Queues are a method to store and process data in the same order as it was received. One really cool thing to note is that queues behave just like batches with a batch size of one (and in order!) This means that many of our batch processing tools work with queues as well! Streaming systems vary a lot based on implementation, but they have two primary attributes: they handle data without pausing along the way, and they don't have a known ending event. You may have also noticed that because streams process data without pausing, they also maintain order in processing like queues!

3. Fire!

Let's consider an analogy to better describe batching vs streaming content - putting out fires. Before the modern era, fire hydrants and fire trucks were uncommon or even non-existent. To put out fires, groups of people would get in a line (or lines) between a water source (like a pond / river) and the fire. They would then pass buckets back and forth - sending full buckets toward the fire, and empty buckets back to the water source. You can consider this a batch process: each bucket of water represents the batch of work to be done. The batch size could increase with a bigger bucket (but the speed might slow as it becomes heavier), or the fire might be extinguished more quickly if the batch frequency increased (faster passing of buckets, more buckets, etc). Now if we consider a more modern approach, we use the firehose. Assuming a constant water source, we can maintain the flow from the source to the fire until it is extinguished. We also likely need fewer people involved in this process compared to the bucket brigade. But we may not know when the water will run out based on what we see from the source. It also requires a bit more planning to enable this option (providing fire hydrants, fire trucks, etc.)

4. How to determine the best approach?

We'll discuss a lot more of this in the coming chapters, but let's talk about some quick tips for what method to choose. The choice depends entirely on our requirements and works best if we understand the needs of what we're trying to do. If we can process the data in groups, batching is often best due to its simplicity. If we can pause our process or scale our processes accordingly, batching is a well understood data processing methodology. If we need to maintain order and it's still okay to pause, we should use a queue. Finally, if we need continuous data processing options, or we don't know how much data needs processing, we should try streaming. Also, if we can't stop until the data is processed, streaming is our best option.

5. Let's practice!

We've introduced a lot of information in this chapter. Let's review in the exercises ahead and we'll see you back in chapter 3 to discuss real-time streaming, scaling streaming systems, and dealing with potential issues.