Get startedGet started for free

Streaming data case study

1. Streaming data case study

Welcome back to Chapter 4! Over the last 3 chapters, we acquired a solid toolbox that we can use for consuming, analyzing, and reacting to streaming data.

2. This chapter

In the last chapter of this course, we will learn how to put everything we learned together. We will send incoming streaming data to Firehose, store it and visualize it. We will set alerts on the values in real time, and monitor our infrastructure. Lastly, we will take a look at the different ways we can meet a set of requirements. In other words, you'll get to do everything a data engineer does - architect, store and monitor!

3. The challenge

The communications department is interested in learning about how the residents feel about living in the City of San Diego. You have been asked to create a dashboard that monitors twitter sentiments in tweets with the #sandiego hashtag.

4. Requirements

There are a few requirements. Tweets must include the #sandiego hashtag. The tweets must come in real-time. Tweets need to be marked as positive or negative. The dashboard should show metrics on the last 15 minutes of tweets. If more than three negative tweets happen in a five minute timeframe, the communications manager should be notified. The stream should minimize data loss due to downtime. Data must persist for later analysis.

5. Tweets come in real-time

The fact that we need to gather data in real-time is a prime example of a streaming use case. We will use a Python script to monitor tweets, and send them to a Firehose stream.

6. Enriched with sentiment

To determine the sentiment of each tweet, we will create a Transform Lambda and use the Amazon Comprehend service which we learned about in the prerequisite course to determine the sentiment of each tweet.

7. Data must persist for later analysis

To persist data for later analysis, we will use S3.

8. Visualize last 15 minutes

However, visualizing the last 15 minutes of data when it's stored in S3 will be difficult, since S3 is really designed as static storage. Luckily, Firehose has two other options - Redshift and Elasticsearch.

9. Redshift vs Elasticsearch

Let's take a closer look at the storage piece. Storing data in S3 for real-time analysis isn't really efficient. Besides S3, Firehose can send data to RedShift and Elasticsearch. Redshift is designed for storing large, clean tables of data with a defined schema. Elasticsearch is schema-less which makes it good for unstructured data like logs and text. We create the schema during the query. Redshift uses SQL for queries, while Elasticsearch has its own language. Redshift works great with BI tools like Tableau, while Elasticsearch has its own interface - Kibana.

10. Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.