1. Building ETL and ELT Pipelines
Welcome back! We touched briefly on building and running ETL and ELT pipelines. Now, let's take a closer look at what actually goes into this process!
2. Extract Data from a CSV File
First, we need to be able to pull data from a csv file. To do this, we can use the pandas library. Once we import pandas as pd, we'll call the read_csv function to extract the data from a file and into a DataFrame. We must provide a file path to the read_csv function, but there are a number of other optional parameters we could use, such as delimiter, header, and engine. These options are useful for files that may be formatted a little bit differently.
If we'd like to view the first few rows of a DataFrame, we can do this using the head method. By default, it will output the first five rows.
3. Filtering a DataFrame
Once we've extracted data into a DataFrame, we can begin to filter it. On the left is a DataFrame immediately after it's been extracted from a csv file. We'd like to transform it to look like the DataFrame on the right. To do this, we can use loc.
loc is called on a DataFrame, and allows for its rows and columns of a DataFrame to be filtered. Here, we're first filtering the DataFrame to only include rows that have the "name" "Apparel". After the comma, is a colon. This denotes that all columns should kept.
Next, the DataFrame is filtered to only include the columns "name" and "num_firms". We'll touch more on loc later in the course.
4. Write a DataFrame to a CSV File
To load data to a file, we can use the to_csv method. We'll call this method on a DataFrame, and pass in a file path. Then, this DataFrame will be stored using the file path we provided.
Like with read_csv, there are a number of other parameters that can be used with the to_csv method to customize how the file is written.
Along with to_csv, other options like to_json, to_excel, and to_sql can be used to load data.
5. Running SQL Queries
Sometimes, we'll want to use SQL to transform data in a data warehouse. To do this, we can write SQL queries using multi-line strings, like this. Then, we can use a Python client, like SQLAlchemy, or the Snowflake Python Connector, to execute queries. Here, we use the execute method to run a SQL query and create a new table.
6. Putting it all together!
Putting it all together looks something like this. We'll start by adding our ETL logic to extract, transform, and load functions. Here's what a transform function might look like. Then, we'll call each of these functions. First, we'll extract data from a file using the extract function. Then, we'll transform this DataFrame to include rows with the name "Apparel". Finally, we'll load the transformed DataFrame to a file called "cleaned_data.csv". Throughout the course, this is how almost all of our ETL and ELT pipelines will be written!
7. Let's practice!
Now that you've got a few new tools in your tool belt, let's give them a try!