1. What is streaming and why does it matter?
Hello hello hello! Welcome to Streaming Data with Amazon Kinesis and AWS Lambda!
My name is Maksim Pecherskiy.
In this class, we're taking a step up from AWS basics and working with real-time data. We will dive into these concepts using (almost) real-life examples from my work in City government.
Ready? Let's stream away!
2. Batch vs stream
There are two main ways of working with data - batch or stream. Neither is better or worse - they're just different tools for different jobs. A data engineer understands the business need and finds the right solution.
3. Batch vs stream
They key difference between batch and stream architectures is when you need to use the data.
Batch is excellent for larger data and complex analysis, but the data can be a bit older - hours or days. Making a daily sales report, projecting next month's performance, or finding customers at risk of leaving are examples.
Streaming is better for simpler analysis on each record or a fairly short time window. The data moves fast - seconds or miliseconds. Cases like fraud detection, monitoring wind turbines or sending alerts in real-time are a fit.
4. Cody and the fleet
We will be helping Cody, the Sustainability Manager for the City of San Diego. As the city's Data Engineer, you will be helping Cody decrease emissions and accidents using the internet of things or IOT.
She wants to put a sensor in each vehicle that will transmit emissions, time and speed for that vehicle. This is known as vehicle telematics, a common streaming data use case.
5. Telematics streaming
You will use Amazon Kinesis for ingesting data from vehicle on-board devices. Next, you will combine it with AWS Lambda Functions for processing that data.
You will also use S3, IAM and SNS from the previous class.
This might look new, scary, and complicated. It looks to everyone that way at first. But as we dive into these tools, you will get much more comfortable.
6. AWS Kinesis
Let's dive into Kinesis. Kinesis has Data Firehose, Data Streams and Data Analytics as sub-services. You'll be combining these to solve streaming data use cases. Let's start with Firehose.
7. Data Firehose
The key component of Firehose is the Firehose Delivery Stream.
8. Delivery streams
Producers, like city vehicles, generate data.
9. Delivery streams
They write it to the Firehose Delivery Stream using boto3 client's .put_record() method (we will learn this in the next lesson).
10. Delivery streams
Next, the Firehose delivery stream delivers this data to a destination for storage - like S3, Redshift, or Elasticsearch.
Let's kick the tires.
11. Creating a Firehose client
We use boto3 to interact with Firehose.
We call boto3's .client() method, passing 'firehose' as the argument. We supply the AWS access_key_id and aws_secret_access_key. We will use 'us-east-1' region.
These credentials are the same we created in the "Intro to Boto" class. More on permissions in the next lesson.
12. Working with delivery streams
Your coworker left some streams in the AWS account after she left to another job. Let's list them by calling .list_delivery_streams() method of the firehose client.
A list of streams is available under the response object's DeliveryStreamNames key.
13. Delete streams
Let's delete the these old streams. We iterate over each stream_name from .list_delivery_streams(), and call the .delete_delivery_stream() method with the stream_name passed to the DeliveryStreamName argument.
Now you have a clean slate for creating new streams.
14. Review
In this lesson we learned the difference between batch processing and streaming data.
We met Cody and got introduced to telematics collection.
We learned the basics of Kinesis and Firehose, learning how to find existing streams and delete them.
We learned key words like "Producer" - the data generator of a Firehose stream, and a "Destination" - where the data gets stored.
15. Let's practice!
Now let's practice working with some streams!