Get startedGet started for free

Tools of the data engineer

1. Tools of the data engineer

Hello again. Great job on the exercises! You should now have a good understanding of what it means to be a data engineer. The data engineer moves data from several sources, processes or cleans it and finally loads it into an analytical database. They do this using several tools. This video acts as an overview to get a feeling for how data engineers fulfill their tasks using these tools. We'll spend some more time to go into the details in the second chapter.

2. Databases

First, data engineers are expert users of database systems. Roughly speaking, a database is a computer system that holds large amounts of data. You might have heard of SQL or NoSQL databases. If not, there are some excellent courses on DataCamp on these subjects. Often, applications rely on databases to provide certain functionality. For example, in an online store, a database holds product data like prices or amount in stock. On the other hand, other databases hold data specifically for analyses. You'll find out more about the difference in later chapters. For now, it's essential to understand that the data engineer's task begins and ends at databases.

3. Processing

Second, data engineers use tools that can help them quickly process data. Processing data might be necessary to clean or aggregate data or to join it together from different sources. Typically, huge amounts of data have to be processed. That is where parallel processing comes into play. Instead of processing the data on one computer, data engineers use clusters of machines to process the data. Often, these tools make an abstraction of the underlying architecture and have a simple API.

4. Processing: an example

For example, have a look at this code. It looks a lot like simple pandas filter or count operations. However, behind the curtains, a cluster of computers could be performing these operations using the PySpark framework. We'll get into the details of different parallel processing frameworks later, but a good data engineer understands these abstractions and knows their limitations.

5. Scheduling

Third, scheduling tools help to make sure data moves from one place to another at the correct time, with a specific interval. Data engineers make sure these jobs run in a timely fashion and that they run in the right order. Sometimes processing jobs need to run in a particular order to function correctly. For example, tables from two databases might need to be joined together after they are both cleaned. In the following diagram, the JoinProductOrder job needs to run after CleanProduct and CleanOrder ran.

6. Existing tools

Luckily all of these tools are so common that there is a lot of choice in deciding which ones to use. In this slide, I'll present a few examples of each kind of tool. Please keep in mind this list is not exhaustive, and that some companies might choose to build their own tools in-house. Two examples of databases are MySQL or PostgreSQL. An example processing tool is Spark or Hive. Finally, for scheduling, we can use Apache Airflow, Oozie, or we can use the simple bash tool: cron.

7. A data pipeline

To sum everything up, you can think of the data engineering pipeline through this diagram. It extracts all data through connections with several databases, transforms it using a cluster computing framework like Spark, and loads it into an analytical database. Also, everything is scheduled to run in a specific order through a scheduling framework like Airflow. A small side note here is that the sources can be external APIs or other file formats too. We'll see this in the exercises.

8. Let's practice!

Enough talking, let's do some exercises!