Testing a data pipeline end-to-end
In this exercise, you'll be working with the same data pipeline as before, which extracts, transforms, and loads tax data. You'll practice testing this pipeline end-to-end to ensure the solution can be run multiple times, without duplicating the transformed data in the parquet file.
pandas
has been loaded as pd
, and the extract()
, transform()
, and load()
functions have already been defined.
This exercise is part of the course
ETL and ELT in Python
Exercise instructions
- Run the ETL pipeline three times, using a
for
-loop. - Print the shape of the
clean_tax_data
in each iteration of the pipeline run. - Read the DataFrame stored in the
"clean_tax_data.parquet"
file into theto_validate
variable. - Output the shape of the
to_validate
DataFrame, comparing it to the shape ofclean_tax_rate
to ensure data wasn't duplicated upon each pipeline run.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Trigger the data pipeline to run three times
____ attempt in range(0, ____):
print(f"Attempt: {attempt}")
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
load(clean_tax_data, "clean_tax_data.parquet")
# Print the shape of the cleaned_tax_data DataFrame
print(f"Shape of clean_tax_data: {clean_tax_data.____}")
# Read in the loaded data, check the shape
to_validate = pd.____("clean_tax_data.parquet")
print(f"Final shape of cleaned data: {to_validate.____}")