Get startedGet started for free

Analyzing data in the stream

1. Analyzing data in the stream

Welcome back! Now, we're diving deep into the streams!

2. Last lesson

In the last lesson, we looked at how to transform data in a Firehose stream using AWS Lambda.

3. This lesson

In this lesson, we will learn how to analyze data as it moves through the stream using a Kinesis Data Analytics application.

4. Kinesis Data Analytics

A Kinesis Data Analytics application accepts at most one source - a Firehose or Data Stream. We can optionally enrich this data using a static reference file from S3.

5. Kinesis Data Analytics

It executes SQL operations on the streaming data, enriching it, aggregating it, or analyzing it.

6. Kinesis Data Analytics

Finally, the output gets sent one, two, or three destinations such as a Firehose Stream or Lambda.

7. Why Kinesis Data Analytics

Wait. So once more, another way to manipulate data in a stream? Why? Remember - Firehose streams have a minimum buffer size of one megabyte or sixty seconds. If we were to use a transformational lambda, we could only look at that fixed size or time window. Kinesis Data Analytics gives us more flexibility, and also prevents us from having to write a heavier, analysis focused lambda function within a stream.

8. Kinesis Data Analytics vs transformation Lambdas

Just like many other things with data engineering - there's not a clear-cut difference, and you have to evaluate the tradeoffs. In a Lambda function, we use Python. In a Kinesis Data Analytics application, we use SQL! Both let us filter data, and perform aggregation over a certain window. However, with Kinesis Data Analytics, we can specify the window. We can also join reference data to the stream, and look at the metrics of the stream at different time frames, as opposed to the interval enforced by Firehose. With Kinesis Data Analytics, we can combine multiple streams. Finally, with Kinesis Data Analytics, we can send the output to other destinations!

9. Kinesis Data Analytics SQL

Let's do a light skim of Kinesis Analytics SQL, which uses the SQL:2008 standard. SOURCE_SQL_STREAM_001, 002, etc - based on how many streams you have - represent your source streaming data in Kinesis. The DESTINATION_SQL_STREAM represents the stream that is the result of your SQL processing. It gets sent to the Kinesis Data Analytics Application Destination. Finally, the stream_pump is a continuous insert query running that inserts data from one in-application stream to another in-application stream.

10. Kinesis Data Analytics SQL

The general SQL flow is something like this. You create the destination SQL stream; you create the pump for the continuous insert to run. Then, you pump the results of your query to the destination stream.

11. Some options

Using this formula, you can build SQL to join multiple streams, enrich data using a join, find anomalies, continuously filter or find the top X repeating items.

12. Finding overpingers

You got notified that there may be some sensors on vehicles that are pinging too frequently. This causes extra network usage and may penalize some drivers for speeding.

13. Let's practice!

Let's put together an analytics application that will find sensors that ping more than twice in a ten seconds interval.