Quick pipeline
Before you parse some more complex data, your manager would like to see a simple pipeline example including the basic steps. For this example, you'll want to ingest a data file, filter a few rows, add an ID column to it, then write it out as JSON data.
The spark
context is defined, along with the pyspark.sql.functions
library being aliased as F
as is customary.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Import the file
2015-departures.csv.gz
to a DataFrame. Note the header is already defined. - Filter the DataFrame to contain only flights with a duration over 0 minutes. Use the index of the column, not the column name (remember to use
.printSchema()
to see the column names / order). - Add an ID column.
- Write the file out as a JSON document named
output.json
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the data to a DataFrame
departures_df = spark.____(____, header=____)
# Remove any duration of 0
departures_df = departures_df.____(____)
# Add an ID column
departures_df = departures_df.____('id', ____)
# Write the file out to JSON format
____.write.____(____, mode='overwrite')