Get startedGet started for free

Streaming roadblocks

1. Streaming roadblocks

Welcome back! We're on our last lesson for this chapter after learning about scaling streaming systems. Let's talk about some of the potential challenges you might face while implementing streaming systems.

2. Scaling review

Before we continue, let's do a quick review of some of the aspects of scaling streaming processes. For vertical scaling (moving to a more powerful computer), compute resources are the primary items that need an increase. This includes CPU, RAM, disk, and network capacity. For horizontal scaling (adding more computers), we're primarily adding more systems as nodes / workers.

3. Initial concerns

Let's discuss some initial concerns regarding scaling for streaming processes: Compute resources need to be available, both in capacity and performance. Lack of or slow resources can cause drastic issues with stream processing. When working with more nodes, we have other issues to be concerned about. This includes additional connectivity for each machine to communicate. It typically requires some form of shared resources where the data being processed can be accessed / saved / etc. There can be significant added complexity with multiple nodes. This often requires some form of cluster or workload management to properly utilize the machines.

4. Communication issues

In addition to resource issues with streaming data, there can also be several common problems while processing data. These can include missing messages, delayed messages, out of order messages, and finally repeat messages. Let's look at each of these individually. Please note that a message in this instance could be any communication detail, but for this scenario, let's consider these to be the instructions in a recipe.

5. Missing messages

The first challenge that we often see in streaming processes is the idea of missing messages or missing events. These represent events that never appear in our stream, usually due to a resource issue (network connectivity, etc). These can be extremely difficult to detect. One method for handling this type of problem is by including a sequence number with the message. Any skips of the sequence can be re-requested / re-sent. Unfortunately this can also delay the streaming pipeline due to the time required to obtain the original message (delaying processing of new events).

6. Delayed messages

Delayed messages are another common issue in streaming processes. These are similar to missing messages, except that they eventually arrive, just taking longer than has been expected (say two seconds instead of one hundred milliseconds). This delay can cause issues in the processing pipeline, effectively limiting the overall throughput of the system. Much like missing messages, this is often related to resource issues on the upstream (aka sending) systems.

7. Out of order messages

Out of order messages is another type of problem in streaming data systems. These are effectively a combination of missing and/or delayed messages. This problem arises when an older message (or messages) appears after newer events. It does require some kind of sequence number or state tracking to detect. The proper handling methods for these issues depends on the type of data process being used. For audio / video related applications, it may be best to drop the out of order message.For event log or order style applications, the pipeline often needs to reorder the messages accordingly to verify the data is correct.

8. Repeat messages

The last issue we'll consider is repeat message. This occurs when the same message is sent multiple times, often due to system issues (restarting a process, network issues, etc). Repeat messages do require sequence handling to completely avoid, but it might be safe to ignore depending on the data process. Note that sometimes it's actually not a problem, but an expected behavior, such as with a temperature measurement (the same temperature measured each minute).

9. Let's practice!

We've looked at several possible issues involved in streaming data processes. Let's practice what you've learned in the exercises ahead.