1. Extracting data from structured sources
In this video, we'll explore the first step of almost every data pipeline - extracting data from a source system. Let's get started!
2. Source systems
Almost all data pipelines are initiated by extracting data from a source system. In this course, we'll extract data from source systems such as CSV, parquet, and JSON files, as well as dynamic stores such as SQL databases.
Data can also be pulled from APIs, which are commonly used to ingest data from a third party. Within organizations, data lakes and warehouses are popular source systems for data pipelines. In more advanced pipelines, it's even common to web scrape the data. Besides these, there are tons of other sources from which data can be pulled.
In this video, we'll explore pulling data from tabular source systems, including parquet files and SQL databases.
3. Reading in parquet files
Chances are, you've probably read a CSV file or two into memory using pandas. However, many other tabular file types can be used to store data. One such file type is Parquet. Apache Parquet is an open-source, column-oriented file format designed for efficient field storage and retrieval. Working with parquet files in pandas is similar to working with CSV files, except reading and writing are much faster!
To read a parquet file into a DataFrame, first, import pandas as pd. Then, pass the file path to the pd-dot-read_parquet function, and if you'd like, pass an engine, such as "fastparquet". It's that easy to ingest a parquet file!
4. Connecting to SQL databases
SQL databases are one of the most commonly used tabular data sources. You can query data from a SQL database and store it in a DataFrame, using the read_sql function in pandas.
To use the read_sql function, a connection object to the SQL database must be created. This can be done using sqlalchemy's create_engine function. create_engine takes a connection URI, which stands for "uniform resource identifier". URIs allow connection information to be formatted in a string, which can then be passed to other tools. These strings take the format: schema_identifier, username, password, host, port, db.
We'll be using the postgresql-plus-psycopg2 schema identifier, and when needed, other connection details will be provided.
The connection object returned by the create_engine function should be passed to the read_sql function, along with a query. This query can be as complex as you'd like, but should ultimately SELECT data from the database of interest. The result of the query is stored in a DataFrame, and is available in memory!
5. Modularity
Very nice! We've explored two common ways to extract data from tabular data sources using functionality such as read_parquet and read_sql.
When we write code to build a data pipeline, it's important to make sure that it's modular in nature. Separating distinct logic into functions increases the readability within a pipeline. When building data pipelines, it's a great idea to separate "extract", "transform" and "load" logic into separate functions.
Modularizing your code also adheres to the PEP-8 principle "don't repeat yourself", and makes code reusable! Above, we've reformatted our logic from the last slide into a function. Now, this code can be reused, which helps save time when developing, and can even expedite troubleshooting.
As you build more data pipelines, modularizing your logic will become more natural, and even provides a framework when getting started on a new pipeline.
6. Let's practice!
Awesome work! In the exercises that follow, we'll finally be able to get our hands on extracting data from tabular source systems. Good luck!