Get startedGet started for free

Building ETL and ELT Pipelines

1. Building ETL and ELT Pipelines

Welcome back! We touched briefly on building and running ETL and ELT pipelines. Now, let's take a closer look at what actually goes into this process!

2. Extract Data from a CSV File

First, we need to be able to pull data from a csv file. To do this, we can use the pandas library. Once we import pandas as pd, we'll call the read_csv function to extract the data from a file and into a DataFrame. We must provide a file path to the read_csv function, but there are a number of other optional parameters we could use, such as delimiter, header, and engine. These options are useful for files that may be formatted a little bit differently. If we'd like to view the first few rows of a DataFrame, we can do this using the head method. By default, it will output the first five rows.

3. Filtering a DataFrame

Once we've extracted data into a DataFrame, we can begin to filter it. On the left is a DataFrame immediately after it's been extracted from a csv file. We'd like to transform it to look like the DataFrame on the right. To do this, we can use loc. loc is called on a DataFrame, and allows for its rows and columns of a DataFrame to be filtered. Here, we're first filtering the DataFrame to only include rows that have the "name" "Apparel". After the comma, is a colon. This denotes that all columns should kept. Next, the DataFrame is filtered to only include the columns "name" and "num_firms". We'll touch more on loc later in the course.

4. Write a DataFrame to a CSV File

To load data to a file, we can use the to_csv method. We'll call this method on a DataFrame, and pass in a file path. Then, this DataFrame will be stored using the file path we provided. Like with read_csv, there are a number of other parameters that can be used with the to_csv method to customize how the file is written. Along with to_csv, other options like to_json, to_excel, and to_sql can be used to load data.

5. Running SQL Queries

Sometimes, we'll want to use SQL to transform data in a data warehouse. To do this, we can write SQL queries using multi-line strings, like this. Then, we can use a Python client, like SQLAlchemy, or the Snowflake Python Connector, to execute queries. Here, we use the execute method to run a SQL query and create a new table.

6. Putting it all together!

Putting it all together looks something like this. We'll start by adding our ETL logic to extract, transform, and load functions. Here's what a transform function might look like. Then, we'll call each of these functions. First, we'll extract data from a file using the extract function. Then, we'll transform this DataFrame to include rows with the name "Apparel". Finally, we'll load the transformed DataFrame to a file called "cleaned_data.csv". Throughout the course, this is how almost all of our ETL and ELT pipelines will be written!

7. Let's practice!

Now that you've got a few new tools in your tool belt, let's give them a try!