Extracting data from parquet files
One of the most common ways to ingest data from a source system is by reading data from a file, such as a CSV file. As data has gotten bigger, the need for better file formats has brought about new column-oriented file types, such as parquet files.
In this exercise, you'll practice extracting data from a parquet file.
This is a part of the course
“ETL and ELT in Python”
Exercise instructions
- Read the parquet file at the path
"sales_data.parquet"
into apandas
DataFrame. - Check the data types of the DataFrame via
print()
ing. - Output the shape of the DataFrame, as well as it's head.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
import pandas as pd
# Read the sales data into a DataFrame
sales_data = pd.____("____", engine="fastparquet")
# Check the data type of the columns of the DataFrames
print(sales_data.____)
# Print the shape of the DataFrame, as well as the head
print(sales_data.____)
print(sales_data.____())
This exercise is part of the course
ETL and ELT in Python
Learn to build effective, performant, and reliable data pipelines using Extract, Transform, and Load principles.
Dive into leveraging pandas to extract, transform, and load data as you build your first data pipelines. Learn how to make your ETL logic reusable, and apply logging and exception handling to your pipelines.
Exercise 1: Extracting data from structure sourcesExercise 2: Extracting data from parquet filesExercise 3: Pulling data from SQL databasesExercise 4: Building functions to extract dataExercise 5: Transforming data with pandasExercise 6: Filtering pandas DataFramesExercise 7: Transforming sales data with pandasExercise 8: Validating data transformationsExercise 9: Persisting data with pandasExercise 10: Loading sales data to a CSV fileExercise 11: Customizing a CSV fileExercise 12: Persisting data to filesExercise 13: Monitoring a data pipelineExercise 14: Logging within a data pipelineExercise 15: Handling exceptions when loading dataExercise 16: Monitoring and alerting within a data pipelineWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.