1. Popular streaming systems
Welcome to the final chapter of the course! In this chapter we're going to cover real-world use cases of streaming data processing and discuss some common implementations. Before we go deeper into the use cases, let's look at some common tools and frameworks for streaming systems.
2. Streaming tools
As mentioned, there are a significant number of data processing tools that can be used in various styles of streaming systems. These tools are available for many different needs.
This allows the system designers to select the best tool for the task at hand.
Some common systems include:
The Celery project
Apache Kafka
Spark streaming
Let's take a look at each of these now.
3. Celery
First, let's take a look at the Celery project. Celery is designed as a distributed task queue / FIFO, meaning it can exist on a single system or multiple as required.
It is used primarily as a job or task queue, containing a set of tasks to be completed.
It often works best for asynchronous tasks, meaning items that need to happen soon, but might not be best handled by a web server or other tool.
This could include sending password reset emails,
fulfilling ebook orders,
resizing images, and so forth.
Celery is designed to permit real-time processing of a significant number of messages.
It also provides the ability to manage and scale the queue(s) according to needs. The scaling can be done in both vertical and horizontal fashion as required.
4. Apache Kafka
Apache Kafka is another tool that is commonly used for streaming data. Kafka is a distributed event streaming system, meaning it is scalable from a single system to many hundreds or even thousands of systems.
Kafka is primarily designed to send events between producers and consumers.
In Kafka, a producer is any process or service that creates events to be shared, specified as a topic.
A topic is primarily an agreed upon message format. This simply means that each topic only contains one type of event.
The events are shared with consumers - components that receive the events and handle them accordingly.
Not every consumer needs to handle the events in the same fashion - there could be a consumer to log events, one to perform some type of data transformation, another to relay it to an external system, and so on.
Kafka is designed to store events for as long as the system is capable, based on storage capacity, retention requirements, and so forth.
Kafka is extremely powerful and can work in many situations, but can be tricky to setup.
5. Kafka applications
Kafka is an interesting platform which provides for some advanced use cases. We'll briefly mention these here, but note that each of these topics is rather complex.
Kafka is best used for sending data between multiple source and destination systems. The use cases vary, but can include
single source of truth scenarios,
change data capture, or CDC,
data backups,
and data system migrations.
Don't worry about what each of these applications specifically mean, but know these are often used when architecting complex data systems.
6. Spark streaming
Spark streaming is a component of Apache Spark,
and is designed to process streaming data.
Spark streaming builds upon the capabilities of Spark to process data with Scala, Python, SQL, and others.
Spark is useful for processing large amounts of data and in various machine learning scenarios.
It is also an option to transition from batch to stream processing using Spark, thanks to each using the same base processing framework.
Note that unlike Celery and Kafka, Spark streaming is not designed to store or log events directly, but to process or transform data as it arrives.
7. Let's practice!
You've learned about various data streaming tools in this lesson - let's practice in the exercises ahead.