Saving a DataFrame in Parquet format
When working with Spark, you'll often start with CSV, JSON, or other data sources. This provides a lot of flexibility for the types of data to load, but it is not an optimal format for Spark. The Parquet format is a columnar data store, allowing Spark to use predicate pushdown. This means Spark will only process the data necessary to complete the operations you define versus reading the entire dataset. This gives Spark more flexibility in accessing the data and often drastically improves performance on large datasets.
In this exercise, we're going to practice creating a new Parquet file and then process some data from it.
The spark object and the df1 and df2 DataFrames have been setup for you.
Questo esercizio fa parte del corso
Cleaning Data with PySpark
Istruzioni dell'esercizio
- View the row count of
df1anddf2. - Combine
df1anddf2in a new DataFrame nameddf3with theunionmethod. - Save
df3to aparquetfile namedAA_DFW_ALL.parquet. - Read the
AA_DFW_ALL.parquetfile and show the count.
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
# View the row count of df1 and df2
print("df1 Count: %d" % df1.____())
print("df2 Count: %d" % ____.____())
# Combine the DataFrames into one
df3 = df1.union(df2)
# Save the df3 DataFrame in Parquet format
df3.____.____('AA_DFW_ALL.parquet', mode='overwrite')
# Read the Parquet file into a new DataFrame and run a count
print(spark.read.____('AA_DFW_ALL.parquet').count())