1. Single system data streaming
Nice work on that last set of exercises! Now let's take a look at a simple data streaming system.
2. Intro to streaming
Before we get into details, we need to discuss streaming and what it means in a data processing context.
So what is streaming?
The primary aspect of a streaming data scenario is that the process is not complete until the data is processed. Unlike a queuing or a batching scenario, the data is received and all steps of the pipeline must be completed.
That said, once the data is initially processed, it could be part of further pipelines (batch, queue, or other streaming).
Streaming data is open-ended, meaning that we have no specific end event. This means that a stream of data could be a few bytes, 100 events, or a continuous stream of information.
The stream is defined by the flow of data, not necessarily the content. It's up to the application to define what to do with the content at any given time.
3. Logs
To better understand streaming, let's look at a fairly simple example,
the data log.
Typically a computing log stores event information, whether that's a log of all user activity on a system, the logs of backups, or even the transactions in a database.
A log could be a simple text file, a binary data store, or a system to export information to multiple clients (such as Apache Kafka).
Log structures typically store information until the system resources are exhausted or are pruned accordingly. This may take a moment to sink in, but a log does not have a defined end, nor is it based entirely on time (ie, one message every 2 seconds). Logs used in this way are a simple implementation of stream processes.
Note that the purpose of the log depends entirely on the application usage.
4. System event log
Let's consider an example of data streaming within a single system - a system event log.
System event logs are present on Windows, Mac, and Linux, among other systems.
They process and store various system event information, such as logins, USB drive insertion, and so forth.
Common examples are the Windows EventLog or syslog running on Mac or Linux.
These tools have some general components, working together in order:
The listener, which accepts messages from the various processes running on the system.
The parser, which understands how to read the information from the other processes.
The logic, which defines what to do with the data, whether to add a timestamp, and so forth.
And for these tools, there is the writer component, which stores the information in a format for review later.
Note that this is a fairly general description for these tools - the details are a bit more complex but are outside the scope of this course.
5. Let's practice!
Single system data streaming is a fairly straightforward concept, but it can take a bit of repetition to be comfortable with. Let's practice some of the details in the exercises ahead.