Validating a data pipeline at "checkpoints"
In this exercise, you'll be working with a data pipeline that extracts tax data from a CSV file, creates a new column, filters out rows based on average taxable income, and persists the data to a parquet file.
pandas
has been loaded as pd
, and the extract()
, transform()
, and load()
functions have already been defined. You'll use these functions to validate the data pipeline at various checkpoints throughout its execution.
This exercise is part of the course
ETL and ELT in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Extract and transform tax_data
raw_tax_data = extract("raw_tax_data.csv")
clean_tax_data = transform(raw_tax_data)
# Check the shape of the raw_tax_data DataFrame, compare to the clean_tax_data DataFrame
print(f"Shape of raw_tax_data: {raw_tax_data.____}")
print(f"Shape of clean_tax_data: {____}")