Testing a data pipeline end-to-end
In this exercise, you'll be working with the same data pipeline as before, which extracts, transforms, and loads tax data. You'll practice testing this pipeline end-to-end to ensure the solution can be run multiple times, without duplicating the transformed data in the parquet file.
pandas has been loaded as pd, and the extract(), transform(), and load() functions have already been defined.
Bu egzersiz
ETL and ELT in Python
kursunun bir parçasıdırEgzersiz talimatları
- Run the ETL pipeline three times, using a
for-loop. - Print the shape of the
clean_tax_datain each iteration of the pipeline run. - Read the DataFrame stored in the
"clean_tax_data.parquet"file into theto_validatevariable. - Output the shape of the
to_validateDataFrame, comparing it to the shape ofclean_tax_rateto ensure data wasn't duplicated upon each pipeline run.
Uygulamalı interaktif egzersiz
Bu örnek kodu tamamlayarak bu egzersizi bitirin.
# Trigger the data pipeline to run three times
____ attempt in range(0, ____):
print(f"Attempt: {attempt}")
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
load(clean_tax_data, "clean_tax_data.parquet")
# Print the shape of the cleaned_tax_data DataFrame
print(f"Shape of clean_tax_data: {clean_tax_data.____}")
# Read in the loaded data, check the shape
to_validate = pd.____("clean_tax_data.parquet")
print(f"Final shape of cleaned data: {to_validate.____}")