Building ETL and ELT Pipelines

1. Building ETL and ELT Pipelines

Welcome back! We touched briefly on building and running ETL and ELT pipelines. Now, let's take a closer look at what actually goes into this process!

2. Extract Data from a CSV File

First, we need to be able to pull data from a csv file. To do this, we can use the pandas library. Once we import pandas as pd, we'll call the read_csv function to extract the data from a file and into a DataFrame. We must provide a file path to the read_csv function, but there are a number of other optional parameters we could use, such as delimiter, header, and engine. These options are useful for files that may be formatted a little bit differently. If we'd like to view the first few rows of a DataFrame, we can do this using the head method. By default, it will output the first five rows.

3. Filtering a DataFrame

Once we've extracted data into a DataFrame, we can begin to filter it. On the left is a DataFrame immediately after it's been extracted from a csv file. We'd like to transform it to look like the DataFrame on the right. To do this, we can use loc. loc is called on a DataFrame, and allows for its rows and columns of a DataFrame to be filtered. Here, we're first filtering the DataFrame to only include rows that have the "name" "Apparel". After the comma, is a colon. This denotes that all columns should kept. Next, the DataFrame is filtered to only include the columns "name" and "num_firms". We'll touch more on loc later in the course.

4. Write a DataFrame to a CSV File

To load data to a file, we can use the to_csv method. We'll call this method on a DataFrame, and pass in a file path. Then, this DataFrame will be stored using the file path we provided. Like with read_csv, there are a number of other parameters that can be used with the to_csv method to customize how the file is written. Along with to_csv, other options like to_json, to_excel, and to_sql can be used to load data.

5. Running SQL Queries

Sometimes, we'll want to use SQL to transform data in a data warehouse. To do this, we can write SQL queries using multi-line strings, like this. Then, we can use a Python client, like SQLAlchemy, or the Snowflake Python Connector, to execute queries. Here, we use the execute method to run a SQL query and create a new table.

6. Putting it all together!

Putting it all together looks something like this. We'll start by adding our ETL logic to extract, transform, and load functions. Here's what a transform function might look like. Then, we'll call each of these functions. First, we'll extract data from a file using the extract function. Then, we'll transform this DataFrame to include rows with the name "Apparel". Finally, we'll load the transformed DataFrame to a file called "cleaned_data.csv". Throughout the course, this is how almost all of our ETL and ELT pipelines will be written!

7. Let's practice!

Now that you've got a few new tools in your tool belt, let's give them a try!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.