In this first chapter, you will be exposed to the world of data engineering! Explore the differences between a data engineer and a data scientist, get an overview of the various tools data engineers use and expand your understanding of how cloud technology plays a role in data engineering. 

What is data engineering?

Tasks of the data engineer

Data engineer or data scientist?

Data engineering problems

Tools of the data engineer

Kinds of databases

Processing tasks

Scheduling tools

Cloud providers

Why cloud computing?

Big players in cloud computing

Cloud services

Introduction to Data Engineering

Now that you know the primary differences between a data engineer and a data scientist, get ready to explore the data engineer's toolbox! Learn in detail about different types of databases data engineers use, how parallel computing is a cornerstone of the data engineer's toolkit, and how to schedule data processing jobs using scheduling frameworks.

Databases

Associations

NoSQL

SQL vs NoSQL

The database schema

Joining on relations

Star schema diagram

What is parallel computing

Why parallel computing?

From task to subtasks

Using a DataFrame

Parallel computation frameworks

Cards

Hadoop

PySpark

Hive

Spark, Hadoop and Hive

A PySpark groupby

Running PySpark files

Workflow scheduling frameworks

Airflow, Luigi and cron

Airflow DAGs

Data engineering toolbox

Having been exposed to the toolbox of data engineers, it's now time to jump into the bread and butter of a data engineer's workflow! With ETL, you will learn how to extract raw data from various sources, transform this raw data into actionable insights, and load it into relevant databases ready for consumption! 

Extract

Data sources

Fetch from an API

Read from a database

Transform

Splitting the rental rate

Prepare for transformations

Joining with ratings

Loading

OLAP or OLTP

Writing to a file

Load into Postgres

Putting it all together

Defining a DAG

Setting up Airflow

Interpreting the DAG

Extract, Transform and Load (ETL)

Cap off all that you've learned in the previous three chapters by completing a real-world data engineering use case from DataCamp! You will perform and schedule an ETL process that transforms raw course rating data, into actionable course recommendations for DataCamp students! 

Course ratings

Exploring the schema

Querying the table

Average rating per course

From ratings to recommendations

Filter out corrupt data

Using the recommender transformation

Scheduling daily jobs

The target table

Defining the DAG

Enable the DAG

Querying the recommendations

Congratulations

Case Study: DataCamp

datacamp_application.sql

Have you heard people talk about data engineers and wonder what it is they do? Do you know what data engineers do but you're not sure how to become one yourself? This course is the perfect introduction. It touches upon all things you need to know to streamline your data processing. This introductory course will give you enough context to start exploring the world of data engineering. It's perfect for people who work at a company with several data sources and don't have a clear idea of how to use all those data sources in a scalable way. Be the first one to introduce these techniques to your company and become the company star employee.

<h2>Get Started in Data Engineering</h2> 
Are you curious about a career in data engineering but don’t know where to start? Or perhaps you want more information on what data engineers do before you take the next steps? This four-hour course is an introduction to data engineering and the core concepts, techniques, and tools you need to understand to do the job.
<br><br> 
<h2>Learn Data Engineering Concepts and Techniques</h2> 
You’ll start by learning the differences between a data engineer and a data scientist (and how they work together) before finding out more about the tools of the trade, specifically talking about cloud computing and parallel computing. By the end of the second chapter, you’ll understand the applications of SQL and NoSQL, using DataFrames, and why parallel computing is so important.
<br><br> 
<h2>Perform ETL in Hands-on Exercises</h2> 
The ETL process is core to a data engineer’s workflow. You will learn how data is extracted, transformed, and loaded to get it ready for analysis and generating insights. At the end of the course, you’ll put all this knowledge into practice by performing and scheduling an ETL process yourself using real-world data.
<br><br> 
Our exercises and interactive tests allow you to review and cement your new knowledge, so you’re confident discussing and applying it once you’ve received your Statement of Accomplishment.
<br><br> 
This introductory course is part of a data engineering Track, which offers you pathways to improve your understanding of data engineering and a clear set of next steps to becoming a professional data engineer.

Intermediate Python

Intermediate SQL

Take the first step to becoming a data engineer today. Learn about the world of data engineering in this four-hour course.

Learn about the world of data engineering in this short course, covering tools and topics like ETL and cloud computing. 

Collection of open source packages for Big Data

Is built from the need to use structured queries for parallel processing

Spark, Hadoop and Hive

Introduction to Data Engineering

Hands-on interactive exercise